Refactor AdaGrad optimizer to support sparse tensors + unary and binary refactoring for new infer storage type logic #7903

cjolivier01 · 2017-09-14T18:58:13Z

Refactor AdaGrad optimizer to support sparse tensors
Add sparse support for _plus_scalar, _minus_scalar, clip

Some additional unary and binary refactoring for new infer storage type logic

eric-haibin-lin · 2017-09-15T17:08:21Z

src/operator/tensor/elemwise_binary_op_basic.cu

@@ -36,21 +36,21 @@ NNVM_REGISTER_OP(_backward_add)
  ElemwiseBinaryOp::BackwardUseNoneWithHalf2<gpu, mshadow_op::identity,
    mshadow_op::identity>);

-NNVM_REGISTER_OP(_sub)
+NNVM_REGISTER_OP(elemwise_sub)


Doesn't this break backward compatibility in cpp package?

not that I know of. I don't think anyone using cpp-package uses these sort of operators.
The naming inconsistency between elemwise_sub and the other similar three is nonsensical.

eric-haibin-lin

The definition for sparse update is to only apply update on weight/states on the rows whose gradients are non-zeros.

We should revisit our approach and have sparse update primitives like:

`scatter_add`: adds `scalar` to all rows of `lhs` (row_sparse) specified by `idx`, which returns row_sparse result. 
`scatter_div`: divides `lhs` (row_sparse) by `rhs` (row_sparse) for the rows specified by `idx`, which returns row_sparse updated result

so that Adagrad can be implemented as:

  indices = grad.indices   
  history[:] = op.elemwise_add(history, op.square(grad))
  srt = op.sqrt(nd.sparse.scatter_add(lhs=history, scalar=eps, idx=indices)
  div = nd.sparse.scatter_div(lhs=grad, rhs=srt, idx=indices)
  weight[:] += (div + sparse.retain(weight, indices) * wd) * -lr

Implementing these primitives takes extra time but definitely very useful when it comes to support sparse udpate for other optimizers implemented in python. @cjolivier01 This involves a large scope. what do you think?

cjolivier01: We discussed before that these scatter ops were internal. It's not clear what you're suggesting with that adagrad code. what are you saying would be different? Feel free to implement other operators if you like.

eric-haibin-lin · 2017-09-15T18:18:12Z

python/mxnet/optimizer.py


    def create_state(self, index, weight):
-        return zeros(weight.shape, weight.context)  # history
+        return zeros(weight.shape, weight.context, stype=self.stype)  # history


Please create the state based on weight.stype. The states should always have the same stype as the weight. Perform sparse update only when w.stype == g.stype == state.stype

they don't, actually. I get update calls for sized-1 dense weights along with the sparse ones. is the update not expected to occur then? Because the non-sparse version updates them.

eric-haibin-lin · 2017-09-15T18:18:32Z

python/mxnet/optimizer.py

@@ -665,26 +667,46 @@ class AdaGrad(Optimizer):
    eps: float, optional
        Small value to avoid division by 0.
    """
-    def __init__(self, eps=1e-7, **kwargs):
+    def __init__(self, eps=1e-7, stype='default', **kwargs):


Please remove stype argument here.

eric-haibin-lin · 2017-09-15T18:20:27Z

python/mxnet/optimizer.py

@@ -665,26 +667,46 @@ class AdaGrad(Optimizer):
    eps: float, optional
        Small value to avoid division by 0.
    """
-    def __init__(self, eps=1e-7, **kwargs):


Please update the documentation the same way as Adam/SGD

can you link to PR?

eric-haibin-lin · 2017-09-15T18:23:28Z

python/mxnet/optimizer.py

-        weight[:] += -lr * (grad / sqrt(history + self.float_stable_eps) + wd * weight)
+        save_history_stype = history.stype
+
+        is_sparse = True if weight.stype != 'default' or grad.stype != 'default' else False


is_sparse = true iff w.stype == g.stype == state.stype==row_sparse

x = True if cond else False <=> x = cond

For Adam and SGD, it

perform dense updates, if everything is default

perform sparse updates, if everything is row_sparse

fallback to dense and print warning message, if the inputs have both sparse and dense

many of these ops support both sparse and dense input combinations (and handle them in an efficient manner without fallback). not to say it's the most efficient way to do it, but it's legal.

eric-haibin-lin · 2017-09-15T22:57:40Z

python/mxnet/optimizer.py

+        if is_sparse:
+            history[:] = op.elemwise_add(history, op.square(grad))
+            assert history.stype == save_history_stype
+            srt = op.sqrt(history)


not adding eps will lead to numerical errors (nan) since some entries in grad.data is zero.

ok, will scatter_plus them

eric-haibin-lin

@cjolivier01 The purpose of the revised scatter_div operator (with specified indices) is that we can make it public to other people who want to work on other optimizers that perform sparse updates. The current scatter_div operator doesn't take indices as input and has some limitations.

BTW when you edit my comment, I don't see any notification so I easily missed your updates...

eric-haibin-lin · 2017-09-18T22:14:23Z

python/mxnet/optimizer.py

+            history[:] += square(grad)
+            div = grad / sqrt(history + self.float_stable_eps)
+
+        weight[:] += (div + weight * wd) * -lr


Instead of weight * wd, it should be sparse.retain(weight, grad.indices) * wd since we're only updating the row slices appeared in grad.indices. Otherwise, the update is not sparse - after one epoch each update touches a million rows if you use weight directly.

Ah, that version made sense! thanks!

eric-haibin-lin · 2017-09-18T22:16:54Z

python/mxnet/optimizer.py

-        weight[:] += -lr * (grad / sqrt(history + self.float_stable_eps) + wd * weight)
+        save_history_stype = history.stype
+
+        is_sparse = True if weight.stype != 'default' or grad.stype != 'default' else False


For Adam and SGD, it

perform dense updates, if everything is default

perform sparse updates, if everything is row_sparse

fallback to dense and print warning message, if the inputs have both sparse and dense

cjolivier01 · 2017-09-19T15:19:11Z

different/more selective scatter behavior can be done in a different PR. I don't have the bandwidth for that now.

eric-haibin-lin · 2017-09-19T17:21:50Z

python/mxnet/optimizer.py

+        if is_sparse:
+            history[:] = op.elemwise_add(history, op.square(grad))
+            assert history.stype == save_history_stype
+            srt = op.sqrt(_internal._scatter_plus_scalar(history, self.float_stable_eps))


use scatter_plus(sparse.retain(history, indices)) instead of scatter_plus(history)? otherwise the scatter_plus is expensive

eric-haibin-lin · 2017-09-19T17:24:26Z

tests/python/unittest/test_sparse_operator.py

@@ -697,6 +697,19 @@ def check_binary_op_with_scalar(stype,
                                       force_overlap=force_overlap,
                                       verbose=False)

+        # plus_scalar


Do we have tests for scatter_plus/scatter_div?

Also scatter_minus

eric-haibin-lin · 2017-09-19T17:25:48Z

tests/python/unittest/test_module.py

-            mod.update()                            # update parameters
-        # print('Epoch %d, Training %s' % (epoch, metric.get()))
-    assert(metric.get()[1] < 0.05), metric.get()[1]
+    def check_factorization_machine_module(optimizer=None, num_epochs=None):


Please add unit test in test_optimizer.py to test sparse AdaGrad..

https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_optimizer.py#L382

test_optimizer appears to test C++ version against python version. There is only a python version for AdaGrad, therefore it's not clear what it tests against. I am using the test_module() test with an expected accuracy rate to test.

To me, the purpose of the tests in test_optimizer is to verify if the update only involves rows appeared in grad.indices for rsp weight and rsp grad

I have asserts in update that test for this as well as storage type, which run during the test_factorization_machine_module() test, which I modified to test multiple optimizers (sgd, adam, adagrad)

…allback logic

…o sparse_adagrad_pr

piiswrong · 2017-10-13T22:05:30Z

I don't think clip should be sparse op.
If clip_min > 0 or clip_max < 0, the result would be dense

piiswrong · 2017-10-14T01:06:05Z

src/operator/tensor/elemwise_binary_op.h

+  /*!
+   * \brief CSR operation requires temp space
+   */
+  struct ResourceRequest {


Used to index into ctx.resources during CSR pass. It is, in fact, referencing a RespurceRequest.

cjolivier01 · 2017-10-14T01:51:17Z

Clip handles that situation in FInferStorageType

cjolivier01 · 2017-10-14T03:52:53Z

Changed

…ry refactoring for new infer storage type logic (apache#7903) Refactor AdaGrad optimizer to support sparse tensors + unary and binary refactoring for new infer storage type logic (apache#7903)

Olivier added 4 commits September 14, 2017 11:44

return storage type inference to _copy

960a3ba

Refactor AdaGrad optimizer to support sparse tensors

c44e8c4

re-remove already-existing storage type inference

1c78e1e

clean up

e2120df

cjolivier01 requested review from mli and piiswrong as code owners September 14, 2017 18:58

Olivier added 3 commits September 14, 2017 12:02

lint

a35da4b

lint

53959b3

remove obsolete parameters

07a10ec

eric-haibin-lin mentioned this pull request Sep 15, 2017

[RoadMap] Legacy issue resolution before 1.0 release #7319

Closed

eric-haibin-lin reviewed Sep 15, 2017

View reviewed changes

Olivier added 2 commits September 18, 2017 09:55

Merge remote-tracking branch 'apache/master' into sparse_adagrad_pr

3aa5a0b

CR comments, add eps addition

a963b65

eric-haibin-lin reviewed Sep 18, 2017

View reviewed changes

sparse.retain

b1abb6c

eric-haibin-lin reviewed Sep 19, 2017

View reviewed changes

cjolivier01 closed this Sep 19, 2017

cjolivier01 reopened this Sep 19, 2017

Olivier added 10 commits September 20, 2017 10:06

Add unit tests for scatter ops

e2e3e3d

Add some more unit tests for csr binary, although many of these use f…

86bbd41

…allback logic

lint

e9a78cf

use sparse_retain

e50eae2

Merge remote-tracking branch 'apache/master' into sparse_adagrad_pr

acfe5b7

change test for sparse

37f88b3

lint

162124d

Merge remote-tracking branch 'apache/master' into sparse_adagrad_pr

56a3ecf

Binary scatter working on CPU and GPU

3568e09

Fix test_laop() unit testing for MKL build

7b0dbe8

apache deleted a comment from eric-haibin-lin Oct 12, 2017

Olivier and others added 7 commits October 12, 2017 09:36

remove warnings

51a5825

Unused variable warning

50d1e5e

CR comments

e2d0bd6

Merge remote-tracking branch 'apache/master' into sparse_adagrad_pr

9db915a

Merge branch 'sparse_adagrad_pr' of github.com:/cjolivier01/mxnet int…

3e80c03

…o sparse_adagrad_pr

lint

0245cab

Merge branch 'master' into sparse_adagrad_pr

cfea64d

cjolivier01 added 5 commits October 13, 2017 15:58

Merge remote-tracking branch 'origin/master' into sparse_adagrad_pr

7ca9ac1

OMP num threads 0->1

06f4171

OMP num threads 0->1

9410989

remove check

e04cec9

Merge branch 'omp_to_1' into sparse_adagrad_pr

2e747e6

piiswrong reviewed Oct 14, 2017

View reviewed changes

cjolivier01 added 2 commits October 13, 2017 20:37

Change ResourceRequest to enum of different name

918b977

force default storage rather than generic callback for clip out of range

1bba852

cjolivier01 added 6 commits October 14, 2017 13:52

lint

02e0954

Merge remote-tracking branch 'apache/master' into sparse_adagrad_pr

5061e60

Trigger build

05ffc21

Merge remote-tracking branch 'apache/master' into sparse_adagrad_pr

26a5c3c

Merge remote-tracking branch 'apache/master' into sparse_adagrad_pr

23126c3

Try build again because CI is broken as usual

382fb1e

cjolivier01 merged commit 9f97dac into apache:master Oct 15, 2017

cjolivier01 deleted the sparse_adagrad_pr branch October 15, 2017 20:35

indhub mentioned this pull request Oct 16, 2017

test_svmoutput_with_type fails in CI builds #8288

Closed

ptrendx mentioned this pull request Aug 14, 2020

Use RTC for elementwise and broadcast ops #18622

Merged

7 tasks

Refactor AdaGrad optimizer to support sparse tensors + unary and binary refactoring for new infer storage type logic #7903

Refactor AdaGrad optimizer to support sparse tensors + unary and binary refactoring for new infer storage type logic #7903

Conversation

cjolivier01 commented Sep 14, 2017 • edited

Choose a reason for hiding this comment

cjolivier01 Sep 15, 2017 • edited

Choose a reason for hiding this comment

eric-haibin-lin left a comment • edited by cjolivier01

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjolivier01 Sep 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjolivier01 Sep 18, 2017 • edited

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjolivier01 commented Sep 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong commented Oct 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjolivier01 commented Oct 14, 2017

cjolivier01 commented Oct 14, 2017

cjolivier01 commented Sep 14, 2017 •

edited

cjolivier01 Sep 15, 2017 •

edited

eric-haibin-lin left a comment •

edited by cjolivier01

cjolivier01 Sep 19, 2017 •

edited

cjolivier01 Sep 18, 2017 •

edited