Model averaging in parameter server by atefeh-asayesh · Pull Request #1336 · apache/systemds

atefeh-asayesh · 2021-07-04T00:34:57Z

we add a new Boolean parameter (ModelAvg) to add an optional feature to have a model averaging approach in parameter servers.

…to have a model averaging approach in parameter servers.

… error

mboehm7

Thanks for getting started on this optional model averaging feature in the paramserv builtin. Please fix the formatting issues, imports and dependencies, as well as make sure the existing gradient accumulation / tests work without changes, while adding the model averaging. In case of model averaging you can further avoiding accruing the gradients.

mboehm7 · 2021-07-19T11:14:49Z

src/main/java/org/apache/sysds/parser/Statement.java

 	public static final String PS_GRADIENTS = "gradients";
 	public static final String PS_SEED = "seed";
+	public static final String PS_MODELAVG = "modelAvg";
+	public static final String PS_MODELS = "models";


why do we need this besides modelAvg - remove if unnecessary.

we need this beside modelAvg because it is one of the inputs of paramserver instruction.

mboehm7 · 2021-07-19T11:15:25Z

src/main/java/org/apache/sysds/runtime/controlprogram/paramserv/FederatedPSControlThread.java


 		// Push the gradients to ps
 		_ps.push(_workerID, gradients);
+		//_ps.push(_workerID, modell)


what's this? please conditionally on the configuration either push the gradients or model

mboehm7 · 2021-07-19T11:15:49Z

src/main/java/org/apache/sysds/runtime/controlprogram/paramserv/FederatedPSControlThread.java

 			}
 		}
 	}
+	//****************************************  ATEFEH *********************************************************************


we do not use author tags - so please remove such comments with your name.

mboehm7 · 2021-07-19T11:16:28Z

src/main/java/org/apache/sysds/runtime/controlprogram/paramserv/LocalPSWorker.java


 	public LocalPSWorker(int workerID, String updFunc, Statement.PSFrequency freq,
-		int epochs, long batchSize, ExecutionContext ec, ParamServer ps)
+						 int epochs, long batchSize, ExecutionContext ec, ParamServer ps,boolean modelavg)


please, avoid corrupting the existing formatting.

mboehm7 · 2021-07-19T11:16:51Z

src/main/java/org/apache/sysds/runtime/controlprogram/paramserv/LocalPSWorker.java

+					if (_modelAvg){
+						computeBatch_Avg(dataSize,batchIter);
+					}
+
+					else
+						computeBatch(dataSize, batchIter);


the formatting seems off again.

mboehm7 · 2021-07-19T11:32:42Z

src/test/scripts/functions/federated/paramserv/CNN.dml

  matrix[double] X_val, matrix[double] y_val, int num_workers, int epochs,
  string utype, string freq, int batch_size, string scheme, string runtime_balancing,
-  string weighting, double eta, int C, int Hin, int Win, int seed = -1)
+  string weighting, double eta, int C, int Hin, int Win, int seed = -1,boolean modelAvg)


mboehm7 · 2021-07-19T11:32:50Z

src/test/scripts/functions/federated/paramserv/CNN.dml

+  string weighting, double eta, int C, int Hin, int Win, int seed = -1,boolean modelAvg)
  return (list[unknown] model)
 {
+


mboehm7 · 2021-07-19T11:33:04Z

src/test/scripts/functions/federated/paramserv/TwoNN.dml

                 matrix[double] X_val, matrix[double] y_val,
                 int epochs, int batch_size, double eta,
-                 int seed = -1)
+                 int seed = -1 , boolean modelAvg )


mboehm7 · 2021-07-19T11:33:10Z

src/test/scripts/functions/federated/paramserv/TwoNN.dml

                 matrix[double] X_val, matrix[double] y_val,
                 int num_workers, int epochs, string utype, string freq, int batch_size, string scheme, string runtime_balancing, string weighting,
-                 double eta, int seed = -1)
+                 double eta, int seed = -1,boolean modelAvg)


formatting.

mboehm7 · 2021-07-19T11:33:24Z

src/test/scripts/functions/federated/paramserv/TwoNN.dml

    val="./src/test/scripts/functions/federated/paramserv/TwoNN.dml::validate",
    k=num_workers, utype=utype, freq=freq, epochs=epochs, batchsize=batch_size,
-    scheme=scheme, runtime_balancing=runtime_balancing, weighting=weighting, hyperparams=hyperparams, seed=seed)
+    scheme=scheme, runtime_balancing=runtime_balancing, weighting=weighting, hyperparams=hyperparams, seed=seed,modelAvg=modelAvg)


formatting.

mboehm7 · 2021-07-24T20:48:41Z

@atefeh-asayesh for the future, please don't comment on every one of these comments (especially if they're repetitive) - just add the commits with the fixes and let me know once you're ready with all of them and I merge it in after another round of review.

…een fixed. Moreover one new parameter has been added to Paramserver instruction. this parameter is used when the "freq=NBATCHES". we implement computnbatch function for both federated and local paramserver. also some points should be considered in some classes: 1-(ParamservRecompilationTest) in this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1 2-ParamservRuntimeNegativeTest In this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1 in localPSWorker for two types of BATCH and EPOCH we check the modelAvg parameter. if the modelAvg is true then we do the averaging model method.

atefeh-asayesh · 2021-07-24T23:52:33Z

Thank you very much for your constructive and useful comments. The formatting issue has been completely resolved. There will be no formatting issues in future requests. Comments about unnecessary imports have also been checked and resolved.
Also, a new parameter called nBatches has been added to paramserver. Based on the changes related to this new parameter, the related files have also been changed.
The modelAvg parameter has also been added to the local parameter server.

Bests
Atefeh

…een fixed. Moreover one new parameter has been added to Paramserver instruction. this parameter is used when the "freq=NBATCHES". we implement computnbatch function for both federated and local paramserver. also some points should be considered in some classes: 1-(ParamservRecompilationTest) in this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1 2-ParamservRuntimeNegativeTest In this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1 in localPSWorker for two types of BATCH and EPOCH we check the modelAvg parameter. if the modelAvg is true then we do the averaging model method.

mboehm7 · 2021-07-25T19:09:50Z

Unfortunately, there are too many problems with this PR (I spent two hours fixing the remaining issues but there is no end in sight). Please split this PR into one new PR only for model averaging (and later, once the model averaging is really working and cleaned up, one PR for nbatches because otherwise all fixes have to be applied in even more locations). In this process, please revert all changes of existing tests and formatting changes, only add targeted new tests for your new features, and try to limit code changes to what is really needed for integrating model averaging as an optional feature.

The best path forward is to keep this PR open, and cherry pick code changes into new PRs and fix the issues. Note that Statement.PS_MODELS is unnecessary and should be removed, the gradient accumulation should only be done if no model averaging is used, the division of model averaging did not take affect (because it was not applied in place and new blocks got discarded in the lambda function), computeBatch always pushed the gradients, the federated control thread (per worker) always pushed the gradients. Also try to keep the core code path as simple as possible: e.g., you can do something like the following:

if (localUpdate | _modelAvg )
   params = updateModel(params, gradients, i, j, batchIter);
...				
// Push the gradients or model to ps
pushGradients(_modelAvg ? params : accGradients.get());

atefeh-asayesh · 2021-07-25T21:51:51Z

Thanks for the comments. The reason that both parameters are considered in one Pull Request was because both parameters are optionally applied in similar tests and files.
A separate test function is also considered to check the nBatches test in some test functions.
About the gradient accumulation I want to discuss about it in meeting.
Removal of PS_MODELS will be done.
computeBatch always pushed the gradients but computeBatch_Avg push the models.
in the federated control thread (per worker) in both computeWithBatchUpdates and also computeWithEpochUpdates, modelAvg condition is considered.
Moreover, I will try to simplify the core code as possible.

mboehm7 · 2021-07-25T22:07:46Z

well just to clarify, in FederatedPSControlThread computeWithBatchUpdate and computeWithNBatchUpdates your code always pushes the gradients right now (through gradient weighting) - for model averaging you would need to exchange the models instead (to allow for more advanced optimizers other than basic SGD).

atefeh-asayesh · 2021-07-25T22:16:21Z

in computeWithBatchUpdate in the case (modelAvg = true ) we compute the gradients with the condition of localupdate is true (gradients = computeGradientsForNBatches(model, 1, localStartBatchNum, true);. and in this case gradients is an updated model positionally. So we push model in case modelAvg=true.
for computeWithNBatchUpdates "modelAvg" is not considered yet. in the next PR I will extend it.

mboehm7 · 2021-07-25T22:31:28Z

well, I must be missing something - in the PR code, all federated variants call computeGradientsForNBatches, which in turns runs the federated UDF federatedComputeGradientsForNBatches , and this UDF in turn always returns new FederatedResponse(FederatedResponse.ResponseType.SUCCESS, new Object[]{accGradients, gradientsTime}); (with the accrued gradients), no matter how it's configured. We can talk about it offline tomorrow if you like.

atefeh-asayesh · 2021-07-25T23:13:50Z

'So, the accrued gradients are in all federated variants potentially. I would like to talk about it tomorrow.
Thanks.

atefeh-asayesh added 8 commits July 4, 2021 01:24

we add a new boolean parameter (ModelAvg) to add an optional feature …

fb90764

…to have a model averaging approach in parameter servers.

we add a new boolean parameter (ModelAvg) to add an optional feature …

4f0ca72

…to have a model averaging approach in parameter servers.

we add a new boolean parameter (ModelAvg) to add an optional feature …

fac3a29

…to have a model averaging approach in parameter servers.

changing the FederatedParamservTest.dml file to solve the error.

f9e07e4

changing the FederatedParamservTest.dml file to solve the error.

536deca

changing the FederatedParamservTest.dml file to solve the error.

a870565

changing the FederatedParamservTest.dml file to solve the error.

77476a6

changing the FederatedParamservTest.java file and check to handle the…

8c85912

… error

mboehm7 reviewed Jul 19, 2021

View reviewed changes

atefeh-asayesh added 4 commits July 19, 2021 17:04

fixing formatting issues.

011189b

fixing formatting issues.

5d1e1d8

fixing formatting issues.

4ebf10f

fixing formatting issues.

465297d

atefeh-asayesh added 5 commits July 24, 2021 22:51

fixing formatting issues.

6b22324

fixing formatting issues.

a05bb06

fixing formatting issues.

f9221f6

fixing formatting issues.

c6d10a5

asfgit closed this in 82536c1 Sep 18, 2021

Conversation

atefeh-asayesh commented Jul 4, 2021

Uh oh!

mboehm7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mboehm7 commented Jul 24, 2021

Uh oh!

atefeh-asayesh commented Jul 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mboehm7 commented Jul 25, 2021

Uh oh!

atefeh-asayesh commented Jul 25, 2021

Uh oh!

mboehm7 commented Jul 25, 2021

Uh oh!

atefeh-asayesh commented Jul 25, 2021

Uh oh!

mboehm7 commented Jul 25, 2021

Uh oh!

atefeh-asayesh commented Jul 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atefeh-asayesh commented Jul 24, 2021 •

edited

Loading