Skip to content

Model averaging in parameter server#1336

Closed
atefeh-asayesh wants to merge 18 commits intoapache:masterfrom
atefeh-asayesh:AverageModel_ParameterServer
Closed

Model averaging in parameter server#1336
atefeh-asayesh wants to merge 18 commits intoapache:masterfrom
atefeh-asayesh:AverageModel_ParameterServer

Conversation

@atefeh-asayesh
Copy link
Copy Markdown
Contributor

we add a new Boolean parameter (ModelAvg) to add an optional feature to have a model averaging approach in parameter servers.

Copy link
Copy Markdown
Contributor

@mboehm7 mboehm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting started on this optional model averaging feature in the paramserv builtin. Please fix the formatting issues, imports and dependencies, as well as make sure the existing gradient accumulation / tests work without changes, while adding the model averaging. In case of model averaging you can further avoiding accruing the gradients.

public static final String PS_GRADIENTS = "gradients";
public static final String PS_SEED = "seed";
public static final String PS_MODELAVG = "modelAvg";
public static final String PS_MODELS = "models";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this besides modelAvg - remove if unnecessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need this beside modelAvg because it is one of the inputs of paramserver instruction.


// Push the gradients to ps
_ps.push(_workerID, gradients);
//_ps.push(_workerID, modell)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this? please conditionally on the configuration either push the gradients or model

}
}
}
//**************************************** ATEFEH *********************************************************************
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not use author tags - so please remove such comments with your name.


public LocalPSWorker(int workerID, String updFunc, Statement.PSFrequency freq,
int epochs, long batchSize, ExecutionContext ec, ParamServer ps)
int epochs, long batchSize, ExecutionContext ec, ParamServer ps,boolean modelavg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, avoid corrupting the existing formatting.

Comment on lines +67 to +72
if (_modelAvg){
computeBatch_Avg(dataSize,batchIter);
}

else
computeBatch(dataSize, batchIter);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the formatting seems off again.

matrix[double] X_val, matrix[double] y_val, int num_workers, int epochs,
string utype, string freq, int batch_size, string scheme, string runtime_balancing,
string weighting, double eta, int C, int Hin, int Win, int seed = -1)
string weighting, double eta, int C, int Hin, int Win, int seed = -1,boolean modelAvg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above.

string weighting, double eta, int C, int Hin, int Win, int seed = -1,boolean modelAvg)
return (list[unknown] model)
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above.

matrix[double] X_val, matrix[double] y_val,
int epochs, int batch_size, double eta,
int seed = -1)
int seed = -1 , boolean modelAvg )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

matrix[double] X_val, matrix[double] y_val,
int num_workers, int epochs, string utype, string freq, int batch_size, string scheme, string runtime_balancing, string weighting,
double eta, int seed = -1)
double eta, int seed = -1,boolean modelAvg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting.

val="./src/test/scripts/functions/federated/paramserv/TwoNN.dml::validate",
k=num_workers, utype=utype, freq=freq, epochs=epochs, batchsize=batch_size,
scheme=scheme, runtime_balancing=runtime_balancing, weighting=weighting, hyperparams=hyperparams, seed=seed)
scheme=scheme, runtime_balancing=runtime_balancing, weighting=weighting, hyperparams=hyperparams, seed=seed,modelAvg=modelAvg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting.

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Jul 24, 2021

@atefeh-asayesh for the future, please don't comment on every one of these comments (especially if they're repetitive) - just add the commits with the fixes and let me know once you're ready with all of them and I merge it in after another round of review.

…een fixed. Moreover one new parameter has been added to Paramserver instruction. this parameter is used when the "freq=NBATCHES". we implement computnbatch function for both federated and local paramserver. also some points should be considered in some classes:

 1-(ParamservRecompilationTest)
in this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1

 2-ParamservRuntimeNegativeTest
 In this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter.
 "nBatches" parameter is consideed as 1

in localPSWorker for two types of BATCH and EPOCH we check the modelAvg parameter. if the modelAvg is true then we do the averaging model method.
@atefeh-asayesh
Copy link
Copy Markdown
Contributor Author

atefeh-asayesh commented Jul 24, 2021

Thank you very much for your constructive and useful comments. The formatting issue has been completely resolved. There will be no formatting issues in future requests. Comments about unnecessary imports have also been checked and resolved.
Also, a new parameter called nBatches has been added to paramserver. Based on the changes related to this new parameter, the related files have also been changed.
The modelAvg parameter has also been added to the local parameter server.

Bests
Atefeh

…een fixed. Moreover one new parameter has been added to Paramserver instruction. this parameter is used when the "freq=NBATCHES". we implement computnbatch function for both federated and local paramserver. also some points should be considered in some classes:

 1-(ParamservRecompilationTest)
in this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter. "nBatches" parameter is consideed as 1

 2-ParamservRuntimeNegativeTest
 In this class test dml file is "paramserv-large-parallelism" in whitch the parameter "freq" in not the main parameter.
 "nBatches" parameter is consideed as 1

in localPSWorker for two types of BATCH and EPOCH we check the modelAvg parameter. if the modelAvg is true then we do the averaging model method.
@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Jul 25, 2021

Unfortunately, there are too many problems with this PR (I spent two hours fixing the remaining issues but there is no end in sight). Please split this PR into one new PR only for model averaging (and later, once the model averaging is really working and cleaned up, one PR for nbatches because otherwise all fixes have to be applied in even more locations). In this process, please revert all changes of existing tests and formatting changes, only add targeted new tests for your new features, and try to limit code changes to what is really needed for integrating model averaging as an optional feature.

The best path forward is to keep this PR open, and cherry pick code changes into new PRs and fix the issues. Note that Statement.PS_MODELS is unnecessary and should be removed, the gradient accumulation should only be done if no model averaging is used, the division of model averaging did not take affect (because it was not applied in place and new blocks got discarded in the lambda function), computeBatch always pushed the gradients, the federated control thread (per worker) always pushed the gradients. Also try to keep the core code path as simple as possible: e.g., you can do something like the following:

if (localUpdate | _modelAvg )
   params = updateModel(params, gradients, i, j, batchIter);
...				
// Push the gradients or model to ps
pushGradients(_modelAvg ? params : accGradients.get());

@atefeh-asayesh
Copy link
Copy Markdown
Contributor Author

Thanks for the comments. The reason that both parameters are considered in one Pull Request was because both parameters are optionally applied in similar tests and files.
A separate test function is also considered to check the nBatches test in some test functions.
About the gradient accumulation I want to discuss about it in meeting.
Removal of PS_MODELS will be done.
computeBatch always pushed the gradients but computeBatch_Avg push the models.
in the federated control thread (per worker) in both computeWithBatchUpdates and also computeWithEpochUpdates, modelAvg condition is considered.
Moreover, I will try to simplify the core code as possible.

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Jul 25, 2021

well just to clarify, in FederatedPSControlThread computeWithBatchUpdate and computeWithNBatchUpdates your code always pushes the gradients right now (through gradient weighting) - for model averaging you would need to exchange the models instead (to allow for more advanced optimizers other than basic SGD).

@atefeh-asayesh
Copy link
Copy Markdown
Contributor Author

in computeWithBatchUpdate in the case (modelAvg = true ) we compute the gradients with the condition of localupdate is true (gradients = computeGradientsForNBatches(model, 1, localStartBatchNum, true);. and in this case gradients is an updated model positionally. So we push model in case modelAvg=true.
for computeWithNBatchUpdates "modelAvg" is not considered yet. in the next PR I will extend it.

@mboehm7
Copy link
Copy Markdown
Contributor

mboehm7 commented Jul 25, 2021

well, I must be missing something - in the PR code, all federated variants call computeGradientsForNBatches, which in turns runs the federated UDF federatedComputeGradientsForNBatches , and this UDF in turn always returns new FederatedResponse(FederatedResponse.ResponseType.SUCCESS, new Object[]{accGradients, gradientsTime}); (with the accrued gradients), no matter how it's configured. We can talk about it offline tomorrow if you like.

@atefeh-asayesh
Copy link
Copy Markdown
Contributor Author

'So, the accrued gradients are in all federated variants potentially. I would like to talk about it tomorrow.
Thanks.

@asfgit asfgit closed this in 82536c1 Sep 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants