Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messaging update #6115

Merged
merged 139 commits into from Sep 4, 2018

Conversation

@raver119
Copy link
Contributor

commented Aug 9, 2018

WIP; DO NOT MERGE;

This PR adds new messaging system for distributed training

shugeo and others added 17 commits Aug 8, 2018
Shugeo dynamic bp (#6106)
* Added dynamic_partition_bp op and helpers. Initial revision.

* Implementation of dynamic_partion_bp and test.

* Fixed implementation bugs.

* Minor fixes with dynamic_partition_bp and test.

* Modified dynamic_partition_bp op and tests.

* Modified dynamic_stitch invocation with dynamic_partition_bp op.

* Another approach to implement.

* Final version of dynamic_partition_bp op.

* Refactored revision of dynamic_partition_bp op.

* The first working revision.
@AlexDBlack
Copy link
Contributor

left a comment

Couple of minor things, otherwise LGTM so far 👍

originId = chunk.getOriginalId();
numChunks = chunk.getNumberOfChunks();
try {
holder = File.createTempFile("FileChunksTracker", "Message");

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Aug 13, 2018

Contributor

Let's make dir configurable. User might want it on SSD different to system drive.

This comment has been minimized.

Copy link
@raver119

raver119 Aug 13, 2018

Author Contributor

Yes, once we'll be ready to merge lots of stuff will become configurable.

raver119 added 12 commits Aug 13, 2018
Merge branch 'master' into r119_message_that
# Conflicts:
#	libnd4j/tests_cpu/layers_tests/DeclarableOpsTests9.cpp
raver119 added 21 commits Aug 29, 2018

@raver119 raver119 changed the title [WIP] Messaging update Messaging update Sep 3, 2018

@AlexDBlack
Copy link
Contributor

left a comment

Overall looks good, though I noted a bunch o small stuff (polishing, style related, lock usage etc).
Also: Change Spark version back to 1 (from 2)?
This PR does change scala default to 2.11 from 2.10 (but I'm OK with that)

function.step(params, updates);
updatesApplied.get().addAndGet(cnt);
log.info("Total updates applied so far for thread [{}]: [{}]", Thread.currentThread().getName(), updatesApplied.get());

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Debug or trace level?

@@ -128,6 +129,10 @@ public void initialize(@NonNull GradientsAccumulator accumulator) {
}

public INDArray encodeUpdates(INDArray updates) {
// getting statistics
log.info("Residual: {amean: {}; amax: {}; 50%: {}; 95%: {}; 99%: {}; 99.9%: {}}; Current Threshold: [{}]", updates.ameanNumber().doubleValue(), updates.amaxNumber().doubleValue(), Transforms.abs(updates, true).percentileNumber(50).doubleValue(), Transforms.abs(updates, true).percentileNumber(90).doubleValue(), Transforms.abs(updates, true).percentileNumber(99).doubleValue(), Transforms.abs(updates, true).percentileNumber(99.9).doubleValue(), currentThreshold.get());

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Remove, or move to a debug mode, or something. Don't want this calculated + logged by default.

if (modelParamsSupplier != null) {
val params = modelParamsSupplier.get();
if (params != null) {
// TODO: We should propagate params across the workers

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Unimplemented? Get the params then don't do anything with them...
If planned to be added later, maybe add UnsupportedOperationException?
(Better to get exception than obscure issues or silent failure)

@@ -93,7 +94,8 @@
protected int threadId;
protected Model originalModel;

protected final Object PARAMS_SYNC_LOCK_OBJECT = new Object();
//protected final Object PARAMS_SYNC_LOCK_OBJECT = new Object();

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Cleanup commented line

@@ -175,6 +180,8 @@ public void updateModel(@NonNull Model model) {
}

Nd4j.getExecutioner().commit();

modelLock.writeLock().unlock();

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Unlock should be in finally block?
Otherwise if there's any sort of an exception, any other threads waiting on that lock could be waiting forever.

public class BaseRequestConsumer<T extends RequestMessage> implements Consumer<T> {

@Override
public void accept(T t) throws Exception {

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Maybe class should be abstract with abstract accept method, not empty accept method?
Seems like a possible way to introduce bugs, in case any aren't overridden...

} catch (InterruptedException e) {
break;
} catch (IOException e) {
e.printStackTrace();

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Log not print?

} catch (InterruptedException e) {
break;
} catch (Exception e) {
e.printStackTrace();

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Log not print?

propagateMessageDirect(new MeshUpdateMessage(mesh.get()));
} catch (Exception e) {
log.error("Wasn't able to propagate message from [{}]", id());
e.printStackTrace();

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Put exception as log.error arg

if (maxBytes <= 0)
throw new ND4JIllegalStateException("MaxBytes must be > 0");

val tempFile = File.createTempFile("messageSplitter","temp");

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Sep 3, 2018

Contributor

Also configurable location here?

raver119 added 3 commits Sep 4, 2018

@raver119 raver119 merged commit 0728a73 into master Sep 4, 2018

0 of 3 checks passed

codeclimate Code Climate encountered an error attempting to analyze this pull request.
Details
continuous-integration/jenkins/branch This commit is being built
Details
continuous-integration/jenkins/pr-head This commit is being built
Details

@raver119 raver119 deleted the r119_message_that branch Sep 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.