Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Zeroes CSR still needs a valid row_pointer array. #7935

Merged
merged 22 commits into from
Oct 10, 2017

Conversation

cjolivier01
Copy link
Member

Fix for: #7920

Looks like problem was introduced here
0b13631

In function
void FillZerosCsrImpl(mshadow::Stream *s, NDArray *dst)

Why?
CSR matrix, even if all "zeros", must have m + 1 items in its row pointer (csr::IndPtr) array. Current implementation leaves all aux arrays empty.

While I was here, added a couple more assertions and also reduced memory allocation for situations where lhs and rhs are the same array.

Added unit test.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of such _zeros implementation was to delay memory allocation until a sparse tensor has some non-zeros. If we want to enforce that a CSR ndarray always has valid indptr array, I believe this(_zeros) is not the only place that violates such constraint. Because a CSR NDArray is constructed with default shape for indptr = (0, )
https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/ndarray.h#L140`

Do you think it's possible for operators to work with indptr being empty?
@reminisce

// Fill a CSR NDArray with zeros by updating the aux shape.
/*! \brief Fill a CSR NDArray with zeros by updating the aux shape
*
* @tparam xpu - cpu or gpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tparam -> param

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what? cryptic comment :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tparam = template parameter. It's actually named by the IDE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. Learnt a new thing today 👍

@reminisce
Copy link
Contributor

I think it's okay for operators to work on a zero csr with empty indptr as long as the csr is checked if zero before performing any regular operations like the one exposed in #7920. In addition to what @eric-haibin-lin said about the intention of delaying memory allocation, I think it's also the intended way to avoid unnecessary memory allocation for zero csr tensors. We can provide a util function checking whether a sparse tensor (csr or rsp) is zero for being used in operator implementation.

@cjolivier01
Copy link
Member Author

I respectfully disagree. Maybe something like empty() should return uninitialized, but returning an Invald CSR array from zeros() and expecting the consuming function to initialize the array seems backwards and difficult to maintain. zeros() should return a CSR that appears as a matrix of zeroes. It is not doing that.

@cjolivier01
Copy link
Member Author

Also note that CheckAndAlloc() isn't going to fill that allocated array with zeros. And indptr array size isn't going to change so it doesn't need to be reallocated larr when the array is used.

@cjolivier01
Copy link
Member Author

While an unintialized RSP matrix is a valid all-zero matrix, this is not the case for CSR. Thus RSP returns a valid matrix and CSR does not.

@reminisce
Copy link
Contributor

I agree that an all zero initialized indptr in a zero csr sounds more correct from the definition point of view of csr and alleviates the burden on writing client code. If memory allocation is not a concern, I'm okay with the change.

@cjolivier01
Copy link
Member Author

cjolivier01 commented Sep 19, 2017 via email

@eric-haibin-lin
Copy link
Member

I agree that we should return a CSR array with valid indptr array for that zeros() function. Should we also update the default shape for indptr to shape(num_rows + 1) in ndarray.h?

@cjolivier01
Copy link
Member Author

cjolivier01 commented Sep 26, 2017

@eric-haibin-lin As far as I can tell, it already is (num_rows + 1):
dst->CheckAndAllocAuxData(csr::kIndPtr, mshadow::Shape1(dst->shape()[0] + 1));

@@ -241,64 +241,64 @@ def test_get_type():
kvtype = 'local_allreduce_cpu'
kv = mx.kv.create(kvtype)
assert kv.type == kvtype

def test_invalid_pull():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before, looks like Haibin fixed maybe

def test_elemwise_csr_same_zeros():
# Zeroes
a = mx.nd.sparse.zeros('csr', (1,1))
b = mx.nd.elemwise_add(a,a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary ops for csr have been disabled in #7947 (currently inferring dense output).
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/elemwise_binary_op.h#L573
Could you add it back with FInferStorage = ElemwiseStorageType<2, 1, true, true, true>) and all the necessary unit tests for CSR binary ops?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done a different way


// Input and output should have the same number of row pointer items (m + 1)
CHECK_EQ(output.aux_shape(csr::kIndPtr), lhs.aux_shape(csr::kIndPtr));
CHECK_EQ(output.aux_shape(csr::kIndPtr), rhs.aux_shape(csr::kIndPtr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove SparseTempStorage class. @piiswrong suggest we register FResourceRequest for all sparse operators that requires temp space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected you changed that in your PR... Your PR handled no CSR types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only refactored fallback logic in the previous PR.. And we don't have csr binary op tests back then..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly messy, IMHO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding identical FResourceRequest items for everything.

@@ -267,7 +267,7 @@ void SparseRetainOpForwardRspImpl(mshadow::Stream<xpu> *s,
if (!input_nd.storage_initialized()
|| idx_data.Size() == 0U
|| input_nd.shape()[0] == 0) {
FillZerosRspImpl(s, *output_nd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed on purpose. Why revert this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a conflict and causing a compile error because my PR changes this function. Why is it changed to const reference?

When resolving the conflict I used the surrounding functions as a reference:

inline void FillDnsZerosRspImpl(mshadow::Stream *s, NDArray *dst)
void PopulateFullIdxRspImpl(mshadow::Stream *s, NDArray *dst)

btw why is there so much const reference being passed around when the object is then modified? I see this everywhere. Is this because lint complains anout not using a pointer for modified items rather than a non-const reference?

Copy link
Contributor

@piiswrong piiswrong Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NDArray is a pointer type. So its ok to modify the content of the object it points to even if the pointer is const

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean NDArray::ptr_ is a const pointer type? NDArray was a const reference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, reverted

}

// Fill a CSR NDArray with zeros by updating the aux shape.
template<typename xpu>
void FillZerosCsrImpl(mshadow::Stream<xpu> *s, const NDArray& dst) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@@ -402,7 +402,7 @@ inline void CopyFromToCsrImpl(const NDArray& from, const NDArray& to, RunContext
// if source storage is not initialized, fill destination with zeros
auto s = ctx.get_stream<to_xpu>();
if (!from.storage_initialized()) {
op::FillZerosCsrImpl<to_xpu>(s, to);
op::FillZerosCsrImpl(s, const_cast<NDArray *>(&to));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const cast should be avoided

@piiswrong piiswrong merged commit a48991f into apache:master Oct 10, 2017
@cjolivier01 cjolivier01 deleted the sparse_fix branch October 10, 2017 18:59
mbaijal pushed a commit to mbaijal/incubator-mxnet that referenced this pull request Oct 12, 2017
* Fix for: apache#7920

* lint

* remove unused variable warning

* Since GPU version of FillZerosCsrImpl() is called from a non-cuda-compiled file, the gpu version is compiled via cuda directly in init_op.cu

* Trigger build

* This test case is BS. I can't even tell what's wrong on the CI build because so many errors coming from this test.

* Update test_kvstore.py

* Update CMakeLists.txt

* merge fix

* Trigger build

* Fix 'dest' item in FIllXXX calls

* Revert NDARray as pointer in Fillxxx
crazy-cat pushed a commit to crazy-cat/incubator-mxnet that referenced this pull request Oct 26, 2017
* Fix for: apache#7920

* lint

* remove unused variable warning

* Since GPU version of FillZerosCsrImpl() is called from a non-cuda-compiled file, the gpu version is compiled via cuda directly in init_op.cu

* Trigger build

* This test case is BS. I can't even tell what's wrong on the CI build because so many errors coming from this test.

* Update test_kvstore.py

* Update CMakeLists.txt

* merge fix

* Trigger build

* Fix 'dest' item in FIllXXX calls

* Revert NDARray as pointer in Fillxxx
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants