Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[WIP] CSRNDArray Tutorial #7656

Closed
wants to merge 12 commits into from

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Aug 29, 2017

Note:

This should not be merged before #7577

@eric-haibin-lin
Copy link
Member Author

@eric-haibin-lin
Copy link
Member Author

@szha

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this tutorial!

(i.e. most of the elements are zeros).

Storing and manipulating such large sparse matrices in the default dense structure results
in wated memory and processing on the zeros.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wated -> wasted

indices_list = [0, 2, 1]
a = mx.nd.sparse.csr_matrix(data_list, indptr_list, indices_list, shape)
# create a CSRNDArray with numpy arrays
data_np = np.array([7, 8, 9])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just use the above lists, data_list, indptr_list, indices_list

- memory consumption is reduced significantly
- certain operations (e.g. matrix-vector multiplication) are much faster

Meanwhile, ``CSRNDArray`` inherits competitve features from ``NDArray`` such as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo at competitive

- certain operations (e.g. matrix-vector multiplication) are much faster

Meanwhile, ``CSRNDArray`` inherits competitve features from ``NDArray`` such as
lazy evaluation and automatic parallelization, which is not available in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is -> are

@eric-haibin-lin eric-haibin-lin changed the title CSRNDArray Tutorial [WIP] CSRNDArray Tutorial Aug 29, 2017
@eric-haibin-lin eric-haibin-lin changed the title [WIP] CSRNDArray Tutorial CSRNDArray Tutorial Aug 30, 2017
@@ -166,7 +175,7 @@ a.copyto(d)
{'b is a': b is a, 'b.asnumpy()':b.asnumpy(), 'c.asnumpy()':c.asnumpy(), 'd.asnumpy()':d.asnumpy()}
```

If the storage types of source array and destination array doesn't match,
* If the storage types of source array and destination array doesn't match,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type...doesn't match
Or
Types...don't match


Many real world datasets deal with high dimensional sparse feature vectors. For instance,
in a recommendation system, the number of categories and users is in the order of millions,
while most users only made a few purchases, leading to feature vectors with high sparsity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make all sentences have a common tense -- which is the present tense here.
Suggestion: while most users typically make a few purchases only, which leads to ...

Storing and manipulating such large sparse matrices in the default dense structure results
in wasted memory and processing on the zeros.
To take advantage of the sparse structure of the matrix, the ``CSRNDArray`` in MXNet
stores the matrix in [compressed sparse row(CSR)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29) format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add a space before the opening parenthesis throughout the document. Please check other occurrences in the doc and fix them as well.
FYI: https://english.stackexchange.com/questions/5987/is-there-any-rule-for-the-placement-of-space-after-and-before-parenthesis

the existing ``NDArray`` is that

- memory consumption is reduced significantly
- certain operations (e.g. matrix-vector multiplication) are much faster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a period after faster.

[0, 2, 1] # indices
[0, 2, 2, 3] # indptr
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I think the suggested text below may help newbies understand the various numbers better. Try it with a newbie if you like (correct the text spacing appropriately).

[7, 8, 9] # data: flattened representation of the dense matrix in row-major format after removing all zeros.
[0, 2, 1] # indices: column indices pointing to the non-zero elements in the dense matrix.
[0, 2, 2, 3] # indptr: index pointers into data[] array that signify start of a row in the dense matrix.
# i.e. Row 0 starts at index pointer 0, pointing to element 7, in data[].
# i.e. Row 1 starts at index pointer 2, pointing to element 9, in data[] since Row 1 is all-zeroes.
# i.e. Row 2 starts at index pointer 2, pointing to element 9, in data[].
# i.e. the last element in indptr is always one past the size of data[], signify end of data[].

# create a CSRNDArray from a scipy csr object
d = mx.nd.sparse.array(c)
{'d':d}
except ImportError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow in the rendered text, there is a newline between try and except and that causes invalid syntax when I cut-paste the text. Please check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now

```python
b = a * 2 # b will be a CSRNDArray since zero multiplied by 2 is still zero
c = a + 1 # c will be a dense NDArray
{'b.stype':b.stype, 'c.stype':c.stype}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say: b will be a CSRNDArray, but I see it as NDArray only. Am I interpreting things correctly?

b = a * 2 # b will be a CSRNDArray since zero multiplied by 2 is still zero
c = a + 1 # c will be a dense NDArray
{'b.stype':b.stype, 'c.stype':c.stype}
{'c.stype': 'default', 'b.stype': 'default'}
a.stype
'csr'
b.stype
'default' <======= NOT a CSRNDArray.
c.stype
'default'
b
[[ 14. 0. 16. 0.]
[ 0. 0. 0. 0.]
[ 0. 18. 0. 0.]]
<NDArray 3x4 @cpu(0)> <======= NOT a CSRNDArray.
a
[[ 7. 0. 8. 0.]
[ 0. 0. 0. 0.]
[ 0. 9. 0. 0.]]
<CSRNDArray 3x4 @cpu(0)>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it results in a dense NDArray because Chris's PR is not merged in, as mentioned in summary


* For operators that don't specialize in sparse arrays, we can still use them with sparse inputs with some performance penalty.
What happens is that MXNet will generate temporary dense inputs from sparse inputs so that the dense operators can be used.
Warning messages will be printed when such storage fallback event happens.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: when such a storage fallback event happens. (add the article: "a")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the warnings printed? I did not see them when I tried in a terminal window on macOS.

d = mx.nd.log(a) # warnings will be printed
a

[[ 7. 0. 8. 0.]
[ 0. 0. 0. 0.]
[ 0. 9. 0. 0.]]
<CSRNDArray 3x4 @cpu(0)>

d

[[ 1.9459101 -inf 2.07944155 -inf]
[ -inf -inf -inf -inf]
[ -inf 2.19722462 -inf -inf]]
<NDArray 3x4 @cpu(0)>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned in summary:
the behavior described for Sparse Operators and Storage Type Inference section requires #7577 and storage inference refactoring so the warning message is not there yet in current master branch

```

* For operators that don't specialize in sparse arrays, we can still use them with sparse inputs with some performance penalty.
What happens is that MXNet will generate temporary dense inputs from sparse inputs so that the dense operators can be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean: temporary dense outputs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant temp dense inputs because dense operator doesn't handle sparse inputs. I should mention the storage type for outputs, too. I'll update the section.

### GPU Support

By default, CSRNDArray operators are executed on CPU. In MXNet, GPU support for CSRNDArray is experimental
with few sparse operators such as cast_storage and dot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: with only a few sparse operators such as...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used few instead of a few because we only have 2 operators supported for GPU.. I can change it if only a few is more accurate.

gpu_device=mx.gpu() # Change this to mx.cpu() in absence of GPUs.

a = mx.nd.sparse.zeros('csr', (100, 100), ctx=gpu_device)
a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run this code on macOS with no GPU, the python session seg-faults. I know that the context is set incorrectly to GPU when GPU is not present, but should the python session seg-fault? Shouldn't the python session give an error/exception that can be caught by the user and handled appropriately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did it segfault and exit python? For me there was error msg "GPU support is disabled..":

>>> mx.nd.sparse.zeros('csr', (100, 100), ctx=mx.gpu())
[20:01:08] src/c_api/c_api_ndarray.cc:148: GPU support is disabled. Compile MXNet with USE_CUDA=1 to enable GPU support.
[20:01:08] /Users/haibilin/mxnet/dmlc-core/include/dmlc/logging.h:308: [20:01:08] src/c_api/c_api_ndarray.cc:546: Operator _zeros is not implemented for GPU.

Stack trace returned 5 entries:
[bt] (0) 0   libmxnet.so                         0x0000000107a73358 _ZN4dmlc15LogMessageFatalD2Ev + 40
[bt] (1) 1   libmxnet.so                         0x000000010822f447 _Z20ImperativeInvokeImplRKN5mxnet7ContextEON4nnvm9NodeAttrsEPNSt3__16vectorINS_7NDArrayENS6_9allocatorIS8_EEEESC_PNS7_IbNS9_IbEEEESF_ + 2039
[bt] (2) 2   libmxnet.so                         0x00000001082304f7 MXImperativeInvoke + 439
[bt] (3) 3   libmxnet.so                         0x0000000108230ace MXImperativeInvokeEx + 46
[bt] (4) 4   _ctypes.so                          0x0000000106fd67d7 ffi_call_unix64 + 79

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/mxnet-0.11.1-py2.7.egg/mxnet/ndarray/sparse.py", line 123, in __repr__
    shape_info, self.context)
  File "/usr/local/lib/python2.7/site-packages/mxnet-0.11.1-py2.7.egg/mxnet/ndarray/ndarray.py", line 1147, in context
    return Context(Context.devtype2str[dev_typeid.value], dev_id.value)
KeyError: 0
>>>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be fixed in #7676

@eric-haibin-lin eric-haibin-lin changed the title CSRNDArray Tutorial [WIP] CSRNDArray Tutorial Aug 31, 2017
@eric-haibin-lin
Copy link
Member Author

Updated the tutorial with runnable example for data iterators.

@eric-haibin-lin eric-haibin-lin changed the title [WIP] CSRNDArray Tutorial CSRNDArray Tutorial Sep 1, 2017
@eric-haibin-lin eric-haibin-lin changed the title CSRNDArray Tutorial [WIP] CSRNDArray Tutorial Sep 13, 2017
@eric-haibin-lin
Copy link
Member Author

Moved to #7921

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants