New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messaging update #6115

Merged
merged 139 commits into from Sep 4, 2018

Conversation

Projects
None yet
3 participants
@raver119
Copy link
Contributor

raver119 commented Aug 9, 2018

WIP; DO NOT MERGE;

This PR adds new messaging system for distributed training

shugeo and others added some commits Aug 8, 2018

Shugeo dynamic bp (#6106)
* Added dynamic_partition_bp op and helpers. Initial revision.

* Implementation of dynamic_partion_bp and test.

* Fixed implementation bugs.

* Minor fixes with dynamic_partition_bp and test.

* Modified dynamic_partition_bp op and tests.

* Modified dynamic_stitch invocation with dynamic_partition_bp op.

* Another approach to implement.

* Final version of dynamic_partition_bp op.

* Refactored revision of dynamic_partition_bp op.

* The first working revision.
@AlexDBlack
Copy link
Member

AlexDBlack left a comment

Couple of minor things, otherwise LGTM so far 👍

Show resolved Hide resolved nd4j/nd4j-common/src/main/java/org/nd4j/linalg/primitives/Atomic.java
originId = chunk.getOriginalId();
numChunks = chunk.getNumberOfChunks();
try {
holder = File.createTempFile("FileChunksTracker", "Message");

This comment has been minimized.

@AlexDBlack

AlexDBlack Aug 13, 2018

Member

Let's make dir configurable. User might want it on SSD different to system drive.

This comment has been minimized.

@raver119

raver119 Aug 13, 2018

Contributor

Yes, once we'll be ready to merge lots of stuff will become configurable.

raver119 added some commits Aug 13, 2018

Merge branch 'master' into r119_message_that
# Conflicts:
#	libnd4j/tests_cpu/layers_tests/DeclarableOpsTests9.cpp

raver119 added some commits Aug 29, 2018

@raver119 raver119 changed the title [WIP] Messaging update Messaging update Sep 3, 2018

@AlexDBlack
Copy link
Member

AlexDBlack left a comment

Overall looks good, though I noted a bunch o small stuff (polishing, style related, lock usage etc).
Also: Change Spark version back to 1 (from 2)?
This PR does change scala default to 2.11 from 2.10 (but I'm OK with that)

function.step(params, updates);
updatesApplied.get().addAndGet(cnt);
log.info("Total updates applied so far for thread [{}]: [{}]", Thread.currentThread().getName(), updatesApplied.get());

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Debug or trace level?

@@ -128,6 +129,10 @@ public void initialize(@NonNull GradientsAccumulator accumulator) {
}

public INDArray encodeUpdates(INDArray updates) {
// getting statistics
log.info("Residual: {amean: {}; amax: {}; 50%: {}; 95%: {}; 99%: {}; 99.9%: {}}; Current Threshold: [{}]", updates.ameanNumber().doubleValue(), updates.amaxNumber().doubleValue(), Transforms.abs(updates, true).percentileNumber(50).doubleValue(), Transforms.abs(updates, true).percentileNumber(90).doubleValue(), Transforms.abs(updates, true).percentileNumber(99).doubleValue(), Transforms.abs(updates, true).percentileNumber(99.9).doubleValue(), currentThreshold.get());

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Remove, or move to a debug mode, or something. Don't want this calculated + logged by default.

if (modelParamsSupplier != null) {
val params = modelParamsSupplier.get();
if (params != null) {
// TODO: We should propagate params across the workers

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Unimplemented? Get the params then don't do anything with them...
If planned to be added later, maybe add UnsupportedOperationException?
(Better to get exception than obscure issues or silent failure)

@@ -93,7 +94,8 @@
protected int threadId;
protected Model originalModel;

protected final Object PARAMS_SYNC_LOCK_OBJECT = new Object();
//protected final Object PARAMS_SYNC_LOCK_OBJECT = new Object();

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Cleanup commented line

@@ -175,6 +180,8 @@ public void updateModel(@NonNull Model model) {
}

Nd4j.getExecutioner().commit();

modelLock.writeLock().unlock();

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Unlock should be in finally block?
Otherwise if there's any sort of an exception, any other threads waiting on that lock could be waiting forever.

public class BaseRequestConsumer<T extends RequestMessage> implements Consumer<T> {

@Override
public void accept(T t) throws Exception {

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Maybe class should be abstract with abstract accept method, not empty accept method?
Seems like a possible way to introduce bugs, in case any aren't overridden...

} catch (InterruptedException e) {
break;
} catch (IOException e) {
e.printStackTrace();

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Log not print?

} catch (InterruptedException e) {
break;
} catch (Exception e) {
e.printStackTrace();

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Log not print?

propagateMessageDirect(new MeshUpdateMessage(mesh.get()));
} catch (Exception e) {
log.error("Wasn't able to propagate message from [{}]", id());
e.printStackTrace();

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Put exception as log.error arg

if (maxBytes <= 0)
throw new ND4JIllegalStateException("MaxBytes must be > 0");

val tempFile = File.createTempFile("messageSplitter","temp");

This comment has been minimized.

@AlexDBlack

AlexDBlack Sep 3, 2018

Member

Also configurable location here?

raver119 added some commits Sep 4, 2018

@raver119 raver119 merged commit 0728a73 into master Sep 4, 2018

0 of 3 checks passed

codeclimate Code Climate encountered an error attempting to analyze this pull request.
Details
continuous-integration/jenkins/branch This commit is being built
Details
continuous-integration/jenkins/pr-head This commit is being built
Details

@raver119 raver119 deleted the r119_message_that branch Sep 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment