Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMML-2085] Initial version of local execution backend #771

Closed
wants to merge 2 commits into from

Conversation

EdgarLGB
Copy link
Member

Hi @mboehm7 ,

I'd like to make a new PR to the be more clear. In fact, concerning the error that occured in the two new added test, I'm not sure that what had happened because they passed on my side. If they do not pass on your side, I'd like to know the exact error message.
And also for the way to handle the multi-thread, I'm not sure if using the old-school way (using the object lock instead of a reentrant lock) is a good idea here.

Thanks for the review,
Guobao

@akchinSTC
Copy link
Contributor

Can an Admin verify this patch?

@mboehm7
Copy link
Contributor

mboehm7 commented May 29, 2018

Well, it was still failing in my win environment with the following exception:

Caused by: org.apache.sysml.runtime.DMLRuntimeException: namespace ./src/test/scripts/functions/paramserv/mnist_lenet_paramserv_minimum_version.dml is undefined
	at org.apache.sysml.runtime.controlprogram.Program.getFunctionProgramBlock(Program.java:93)
	at org.apache.sysml.runtime.controlprogram.paramserv.ParamServer$AggregationService.<init>(ParamServer.java:137)
	at org.apache.sysml.runtime.controlprogram.paramserv.ParamServer.<init>(ParamServer.java:65)
	at org.apache.sysml.runtime.controlprogram.paramserv.LocalParamServer.<init>(LocalParamServer.java:33)
	at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.createPS(ParamservBuiltinCPInstruction.java:200)
	at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:103)
	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
	... 35 more

After debugging this a bit, the issue was OS-specific because the namespace is appended to the working directory with File.separator which gives ".\src/..." on windows. However, since this is unnecessary, we should always consistently use / instead. With this change, the tests are running fine. I'll make another pass over this PR soon.

@mboehm7
Copy link
Contributor

mboehm7 commented May 29, 2018

LGTM - thanks @EdgarLGB for the initial version and incorporating the earlier feedback. I think this is a good starting point for additional improvements. During merge, I will cleanup a couple of minor issues (warnings, array handling, matrix slicing), which are mostly just syntactic.

Looking forward there are a couple of things we should address next (before moving to distributed backends):

  1. Locking and Synchronization: I agree, the current synchronization schemes doesn't seem very nice and extensible. After thinking about it, maybe we could use the following simple queuing model: Let every worker have a input queue and all workers a shared output queue. By using blocking queues, the workers could simply dequeue the next model (and block until these are available without any other synchronization). In case of a synchronous PS, the PS would wait for all N worker outputs, aggregate and subsequently place the updated model in each workers queue. Similarly, for an asynchronous PS, the PS takes each worker output, aggregates it and puts the updated model immediately into the workers queue. This way, we get extensibility and do not have to worry about anomalies, like pull-after-push.

  2. Error handling: While playing around with the locking scheme of the parameter server, I encountered unrelated errors that led to the parameter server hanging. We need to make sure all worker errors are correctly propagated so that we can guarantee termination.

  3. Usage: To simplify the usage of parameter servers, it might be very useful to allow the passing of function pointers alternatively to function names and namespaces. For example, there is a test that uses a namespace that is defined in a different file which might lead to inconsistencies.

  4. Functions per Worker: If I did not miss it, we currently don't create deep copies of functions per worker. This is problematic in terms of potential conflicts of concurrent dynamic recompilation, potential file name conflicts, and, over-provisioning CPU resources.

@asfgit asfgit closed this in 97018d4 May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants