Model averaging in parameter server#1336
Model averaging in parameter server#1336atefeh-asayesh wants to merge 18 commits intoapache:masterfrom atefeh-asayesh:AverageModel_ParameterServer
Conversation
…to have a model averaging approach in parameter servers.
…to have a model averaging approach in parameter servers.
…to have a model averaging approach in parameter servers.
mboehm7
left a comment
There was a problem hiding this comment.
Thanks for getting started on this optional model averaging feature in the paramserv builtin. Please fix the formatting issues, imports and dependencies, as well as make sure the existing gradient accumulation / tests work without changes, while adding the model averaging. In case of model averaging you can further avoiding accruing the gradients.
| public static final String PS_GRADIENTS = "gradients"; | ||
| public static final String PS_SEED = "seed"; | ||
| public static final String PS_MODELAVG = "modelAvg"; | ||
| public static final String PS_MODELS = "models"; |
There was a problem hiding this comment.
why do we need this besides modelAvg - remove if unnecessary.
There was a problem hiding this comment.
we need this beside modelAvg because it is one of the inputs of paramserver instruction.
|
|
||
| // Push the gradients to ps | ||
| _ps.push(_workerID, gradients); | ||
| //_ps.push(_workerID, modell) |
There was a problem hiding this comment.
what's this? please conditionally on the configuration either push the gradients or model
| } | ||
| } | ||
| } | ||
| //**************************************** ATEFEH ********************************************************************* |
There was a problem hiding this comment.
we do not use author tags - so please remove such comments with your name.
|
|
||
| public LocalPSWorker(int workerID, String updFunc, Statement.PSFrequency freq, | ||
| int epochs, long batchSize, ExecutionContext ec, ParamServer ps) | ||
| int epochs, long batchSize, ExecutionContext ec, ParamServer ps,boolean modelavg) |
There was a problem hiding this comment.
please, avoid corrupting the existing formatting.
| if (_modelAvg){ | ||
| computeBatch_Avg(dataSize,batchIter); | ||
| } | ||
|
|
||
| else | ||
| computeBatch(dataSize, batchIter); |
There was a problem hiding this comment.
the formatting seems off again.
| matrix[double] X_val, matrix[double] y_val, int num_workers, int epochs, | ||
| string utype, string freq, int batch_size, string scheme, string runtime_balancing, | ||
| string weighting, double eta, int C, int Hin, int Win, int seed = -1) | ||
| string weighting, double eta, int C, int Hin, int Win, int seed = -1,boolean modelAvg) |
| string weighting, double eta, int C, int Hin, int Win, int seed = -1,boolean modelAvg) | ||
| return (list[unknown] model) | ||
| { | ||
|
|
| matrix[double] X_val, matrix[double] y_val, | ||
| int epochs, int batch_size, double eta, | ||
| int seed = -1) | ||
| int seed = -1 , boolean modelAvg ) |
| matrix[double] X_val, matrix[double] y_val, | ||
| int num_workers, int epochs, string utype, string freq, int batch_size, string scheme, string runtime_balancing, string weighting, | ||
| double eta, int seed = -1) | ||
| double eta, int seed = -1,boolean modelAvg) |
| val="./src/test/scripts/functions/federated/paramserv/TwoNN.dml::validate", | ||
| k=num_workers, utype=utype, freq=freq, epochs=epochs, batchsize=batch_size, | ||
| scheme=scheme, runtime_balancing=runtime_balancing, weighting=weighting, hyperparams=hyperparams, seed=seed) | ||
| scheme=scheme, runtime_balancing=runtime_balancing, weighting=weighting, hyperparams=hyperparams, seed=seed,modelAvg=modelAvg) |
|
@atefeh-asayesh for the future, please don't comment on every one of these comments (especially if they're repetitive) - just add the commits with the fixes and let me know once you're ready with all of them and I merge it in after another round of review. |
…een fixed. Moreover one new parameter has been added to Paramserver instruction. this parameter is used when the "freq=NBATCHES". we implement computnbatch function for both federated and local paramserver. also some points should be considered in some classes: 1-(ParamservRecompilationTest) in this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1 2-ParamservRuntimeNegativeTest In this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1 in localPSWorker for two types of BATCH and EPOCH we check the modelAvg parameter. if the modelAvg is true then we do the averaging model method.
|
Thank you very much for your constructive and useful comments. The formatting issue has been completely resolved. There will be no formatting issues in future requests. Comments about unnecessary imports have also been checked and resolved. Bests |
…een fixed. Moreover one new parameter has been added to Paramserver instruction. this parameter is used when the "freq=NBATCHES". we implement computnbatch function for both federated and local paramserver. also some points should be considered in some classes: 1-(ParamservRecompilationTest) in this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1 2-ParamservRuntimeNegativeTest In this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1 in localPSWorker for two types of BATCH and EPOCH we check the modelAvg parameter. if the modelAvg is true then we do the averaging model method.
|
Unfortunately, there are too many problems with this PR (I spent two hours fixing the remaining issues but there is no end in sight). Please split this PR into one new PR only for model averaging (and later, once the model averaging is really working and cleaned up, one PR for nbatches because otherwise all fixes have to be applied in even more locations). In this process, please revert all changes of existing tests and formatting changes, only add targeted new tests for your new features, and try to limit code changes to what is really needed for integrating model averaging as an optional feature. The best path forward is to keep this PR open, and cherry pick code changes into new PRs and fix the issues. Note that Statement.PS_MODELS is unnecessary and should be removed, the gradient accumulation should only be done if no model averaging is used, the division of model averaging did not take affect (because it was not applied in place and new blocks got discarded in the lambda function), computeBatch always pushed the gradients, the federated control thread (per worker) always pushed the gradients. Also try to keep the core code path as simple as possible: e.g., you can do something like the following: |
|
Thanks for the comments. The reason that both parameters are considered in one Pull Request was because both parameters are optionally applied in similar tests and files. |
|
well just to clarify, in FederatedPSControlThread computeWithBatchUpdate and computeWithNBatchUpdates your code always pushes the gradients right now (through gradient weighting) - for model averaging you would need to exchange the models instead (to allow for more advanced optimizers other than basic SGD). |
|
in computeWithBatchUpdate in the case (modelAvg = true ) we compute the gradients with the condition of localupdate is true (gradients = computeGradientsForNBatches(model, 1, localStartBatchNum, true);. and in this case gradients is an updated model positionally. So we push model in case modelAvg=true. |
|
well, I must be missing something - in the PR code, all federated variants call |
|
'So, the accrued gradients are in all federated variants potentially. I would like to talk about it tomorrow. |
we add a new Boolean parameter (ModelAvg) to add an optional feature to have a model averaging approach in parameter servers.