Skip to content

[SYSTEMDS-???] Parameterserver aggregation optimization#1211

Closed
Baunsgaard wants to merge 2 commits intoapache:masterfrom
Baunsgaard:Conv2d-2
Closed

[SYSTEMDS-???] Parameterserver aggregation optimization#1211
Baunsgaard wants to merge 2 commits intoapache:masterfrom
Baunsgaard:Conv2d-2

Conversation

@Baunsgaard
Copy link
Contributor

This PR contains two small commits.

  1. Remove incorrect warning of native error in MKL call. that print in case of
  2. AggregateFinalResults avoid Generic access by forcing dense if the format of both parts are different.

the second part make the execution of CNN ParameterServer with batch size 32 on MNIST go from 260 sec to 200 on my laptop, with minimal changes (10 lines), i don't think this have any bad side effects but would like confirmation.

@mboehm7
Copy link
Contributor

mboehm7 commented Mar 24, 2021

thanks for following up on this, instead of converting the vector back to dense (after it has been copied from dense to sparse), couldn't we directly do ret.copy(rtasks.get(0).get(), false); //for init before the call to aggregateFinalResults to keep the accumulator dense?

Aggregate of final results would use a generic aggregation if
either inputs were different format from each other.
This commit change the behavior to force uniform format.

Furthermore allocating the initial result in dense improve performance
slightly more with 1 second

The change improve performance of:
    uack+   59.179sec   ->  uack+  9.0sec
on CNN implementation using the parameterserver on MNIST.
@Baunsgaard
Copy link
Contributor Author

thanks for following up on this, instead of converting the vector back to dense (after it has been copied from dense to sparse), couldn't we directly do ret.copy(rtasks.get(0).get(), false); //for init before the call to aggregateFinalResults to keep the accumulator dense?

I Just tried it, we still encounter the case where the new values incoming are sparse, while our aggregate is dense.
the situation we want to avoid is this mix, because then we end up using our generic LibMatrixAgg.aggregateBinaryMatrixLastRowDenseGeneric () in this code path when we have correction enabled:

image

But i do get that it makes more sense to force the initial accumulator to be dense, since it most likely will be in the end so i added the copy to dense part you suggest, but overall this gives us a second better execution time on 32 batch size 1 epoch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants