Skip to content

Conversation

@tobiasrieger
Copy link
Contributor

This PR contains a lot of valuable features for the parameter server

Validation

There is now a second createPS function in the ParamservBuiltinCPInstruction. If all the additional arguments are specified the parameter server is able to validate after each epoch. It will do so if LOG.info is enabled.
This feature is implemented for the federated parameter server ONLY, but is easily implemented for the other cases too.

Federated Parameter Server Statistics

The federated parameter server now uses the same statistics as the regular one, where possible. Also a number of new ones were introduced:

  • Aggregated Validation Time
  • Aggregated Fed Communication Time
  • Federated Data Partitioning Time
  • Aggregated Fed Batch Weighing Time
  • Fed Worker Computation Time (This includes gradient calculation, local updates and batch slicing. The granularity can be improved if needed.)

Other Changes

  • scaleAndPushGradients was refactored to weighAndPushGradients
  • Logging cleanup
  • Not implemented exceptions
  • Validation Functions for the Federated Parameter Server Test DML files

@Baunsgaard
Copy link
Contributor

There is now a second createPS function in the ParamservBuiltinCPInstruction. If all the additional arguments are specified the parameter server is able to validate after each epoch. It will do so if LOG.info is enabled.
This feature is implemented for the federated parameter server ONLY, but is easily implemented for the other cases too.

I can understand why this is done with the logging. But i think this is a slight misuse of the logging in the system, since ideally we (or at least me) want no difference in execution (or as close to) when we set different logging levels. This violates it since the validation in the parameter server can take many seconds of execution time.

I would suggest adding a parameter some other way to enable / disable the validation. While it is still possible that we output the validation scores through LOG.info when that other "unrelated to logging level" setting is set.

But maybe I'm the only one with this opinion? @mboehm7 ?

furthermore designwise, you might want to consider doing this validation in parallel with the distribution of model parameters. since we know that the controller is doing nothing while the workers are training, you could do the validation on a separate thread at no cost to execution time.

Copy link
Contributor

@Baunsgaard Baunsgaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two logic errors i spotted.


protected synchronized void updateGlobalModel(int workerID, ListObject gradients) {
try {
if (LOG.isDebugEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything is surrounded by a is debug enabled witch is a lower level than info.

if(LOG.isInfoEnabled()) {
// This if works similarly to the one for BSP, but divides the sync couter through the number of workers,
// creating "Pseudo Epochs"
if (LOG.isInfoEnabled() && _validationPossible &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outerif checks the same as first part of inner if.

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 15, 2021

yes, I agree with @Baunsgaard regarding logging as it might hide bugs if there is additional state updated. Just guard the actual info logging (and string concatenation). Regarding parallelization please put a TODO in there and focus on the main parts of the validation first.

@tobiasrieger
Copy link
Contributor Author

Thank you @Baunsgaard and @mboehm7! I have addressed the issues you raised.
Changes:
Fixed Logic Errors
Only guarded the actually logging and changed the if structure a bit
Shortened Statistics output

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 30, 2021

LGTM - thanks for the extension @tobiasrieger. I only made minor changes during the merge: (1) reverted the cumulative time measurement in Timing and changed the setup measurement accordingly, (2) added the data partitioning timing (time was measured but not collected), (3) reverted the changed test logging level to INFO, and (4) some minor formatting changes to avoid huge indentation.

@asfgit asfgit closed this in b6640d9 Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants