[SYSTEMDS-2550] Parameter Server Validation and FedParamServ Statistics #1154

tobiasrieger · 2021-01-13T21:42:27Z

This PR contains a lot of valuable features for the parameter server

Validation

There is now a second createPS function in the ParamservBuiltinCPInstruction. If all the additional arguments are specified the parameter server is able to validate after each epoch. It will do so if LOG.info is enabled.
This feature is implemented for the federated parameter server ONLY, but is easily implemented for the other cases too.

Federated Parameter Server Statistics

The federated parameter server now uses the same statistics as the regular one, where possible. Also a number of new ones were introduced:

Aggregated Validation Time
Aggregated Fed Communication Time
Federated Data Partitioning Time
Aggregated Fed Batch Weighing Time
Fed Worker Computation Time (This includes gradient calculation, local updates and batch slicing. The granularity can be improved if needed.)

Other Changes

scaleAndPushGradients was refactored to weighAndPushGradients
Logging cleanup
Not implemented exceptions
Validation Functions for the Federated Parameter Server Test DML files

Baunsgaard · 2021-01-15T08:43:47Z

There is now a second createPS function in the ParamservBuiltinCPInstruction. If all the additional arguments are specified the parameter server is able to validate after each epoch. It will do so if LOG.info is enabled.
This feature is implemented for the federated parameter server ONLY, but is easily implemented for the other cases too.

I can understand why this is done with the logging. But i think this is a slight misuse of the logging in the system, since ideally we (or at least me) want no difference in execution (or as close to) when we set different logging levels. This violates it since the validation in the parameter server can take many seconds of execution time.

I would suggest adding a parameter some other way to enable / disable the validation. While it is still possible that we output the validation scores through LOG.info when that other "unrelated to logging level" setting is set.

But maybe I'm the only one with this opinion? @mboehm7 ?

furthermore designwise, you might want to consider doing this validation in parallel with the distribution of model parameters. since we know that the controller is doing nothing while the workers are training, you could do the validation on a separate thread at no cost to execution time.

Baunsgaard

two logic errors i spotted.

Baunsgaard · 2021-01-15T08:46:44Z

src/main/java/org/apache/sysds/runtime/controlprogram/paramserv/ParamServer.java

+
 	protected synchronized void updateGlobalModel(int workerID, ListObject gradients) {
 		try {
 			if (LOG.isDebugEnabled()) {


everything is surrounded by a is debug enabled witch is a lower level than info.

Baunsgaard · 2021-01-15T08:47:06Z

src/main/java/org/apache/sysds/runtime/controlprogram/paramserv/ParamServer.java

+					if(LOG.isInfoEnabled()) {
+						// This if works similarly to the one for BSP, but divides the sync couter through the number of workers,
+						// creating "Pseudo Epochs"
+						if (LOG.isInfoEnabled() && _validationPossible &&


outerif checks the same as first part of inner if.

mboehm7 · 2021-01-15T10:27:27Z

yes, I agree with @Baunsgaard regarding logging as it might hide bugs if there is additional state updated. Just guard the actual info logging (and string concatenation). Regarding parallelization please put a TODO in there and focus on the main parts of the validation first.

…mmunciation overhead

tobiasrieger · 2021-01-29T08:25:03Z

Thank you @Baunsgaard and @mboehm7! I have addressed the issues you raised.
Changes:
Fixed Logic Errors
Only guarded the actually logging and changed the if structure a bit
Shortened Statistics output

mboehm7 · 2021-01-30T22:12:21Z

LGTM - thanks for the extension @tobiasrieger. I only made minor changes during the merge: (1) reverted the cumulative time measurement in Timing and changed the setup measurement accordingly, (2) added the data partitioning timing (time was measured but not collected), (3) reverted the changed test logging level to INFO, and (4) some minor formatting changes to avoid huge indentation.

Baunsgaard reviewed Jan 15, 2021

View reviewed changes

Tobias Rieger added 10 commits January 29, 2021 09:15

[SYSTEMDS-2550] added validation function to parameter server parameters

69a9f70

[SYSTEMDS-2550] implemented validation for BSP

2140721

[SYSTEMDS-2550] implemented validation for ASP with pseudo epochs

19d3db7

[SYSTEMDS-2550] fixed validation for CNN and improved testcases

5705c50

[SYSTEMDS-2550] Added diverse statistics

8ca326a

[SYSTEMDS-2550] Added communication time statistic

49f8aa1

[SYSTEMDS-2550] Cleaning for pull request

ab91aa5

[SYSTEMDS-2550] Removed a PUT_VAR federated Request to cut down on co…

639c508

…mmunciation overhead

[SYSTEMDS-2550] Added a statistic for total parameterserver run time

8ccaf7d

[SYSTEMDS-2550] Added changes suggested by Sebastian and Matthias

f5044d8

asfgit closed this in b6640d9 Jan 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-2550] Parameter Server Validation and FedParamServ Statistics #1154

[SYSTEMDS-2550] Parameter Server Validation and FedParamServ Statistics #1154

Uh oh!

tobiasrieger commented Jan 13, 2021

Uh oh!

Baunsgaard commented Jan 15, 2021

Uh oh!

Baunsgaard left a comment

Uh oh!

Baunsgaard Jan 15, 2021

Uh oh!

Baunsgaard Jan 15, 2021

Uh oh!

mboehm7 commented Jan 15, 2021

Uh oh!

tobiasrieger commented Jan 29, 2021

Uh oh!

mboehm7 commented Jan 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SYSTEMDS-2550] Parameter Server Validation and FedParamServ Statistics #1154

[SYSTEMDS-2550] Parameter Server Validation and FedParamServ Statistics #1154

Uh oh!

Conversation

tobiasrieger commented Jan 13, 2021

Validation

Federated Parameter Server Statistics

Other Changes

Uh oh!

Baunsgaard commented Jan 15, 2021

Uh oh!

Baunsgaard left a comment

Choose a reason for hiding this comment

Uh oh!

Baunsgaard Jan 15, 2021

Choose a reason for hiding this comment

Uh oh!

Baunsgaard Jan 15, 2021

Choose a reason for hiding this comment

Uh oh!

mboehm7 commented Jan 15, 2021

Uh oh!

tobiasrieger commented Jan 29, 2021

Uh oh!

mboehm7 commented Jan 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants