-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Spark Model Training with multiple executors #5330
Comments
@Javelinjs Do we have distributed training with spark? |
Yes we have distributed training on spark, while more features are required to bring it to production environment, see #2268 |
" Also, remember to set --executor-cores 1 to ensure there's only one task run in one Spark executor." Is this outdated? This seems to insinuate that it won't work with multiple partitions per executor. |
That is used to avoid running multiple threads on one mxnet engine, which may cause problems. |
So this is the line of code that brought me to write the ticket. I may be totally misreading it, but it looks like it using a separate model in each of the mapPartitions. Then once you have an rdd of models, it returns the first model which is only representative of one partition of data. So it seems like some of the data would not be represented, if there was more than 1 partition. If this is the case, I guess I am a little confused on how this is distributed, since everything has to be on a single node to work correctly. |
Model parameters are stored in parameter-server, which is pulled by workers during training. This is how workers 'see' with each other. |
This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks! |
This is more of a feature request that I may even be willing to work on in my spare time. However, I first wanted to see if anyone is either working on this or had any specific plans? Basically right now, it seems like you can only use one executor for training an mxnet model.
It would be nice if mxnet could do model parameter averaging across spark partitions over each iteration of training (or multiple). This would require each partition to have access to an mxnet model, then apply the training on the subset of data (partition), and after an iteration(s) is completed, we could then average the weights and biases for the central core model. We then would pass the updated model to each of the partitions for another iteration or series of iterations and repeat.
I know there are other deep learning libraries that have something similar, and I wanted to see where you guys were on this.
The text was updated successfully, but these errors were encountered: