Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add three sparse tutorials #7921

Merged
merged 11 commits into from Oct 4, 2017
Merged

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Sep 16, 2017

Copy link
Member

@rahul003 rahul003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tutorials look good :)

Added a few minor comments


``dense[rsp.indices[i], :, :, :, ...] = rsp.data[i, :, :, :, ...]``

A RowSparseNDArray is typically used to represent non-zero row slices of a large NDArray of shape [LARGE0, D1, .. , Dn] where LARGE0 >> D0 and most row slices are zeros.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of LARGE0 just felt slightly odd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Any suggestion for better names?


## Inspecting Arrays

Also as with `CSRNDArray`, you can inspect the contents of a `RowSparseNDArray` by filling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can be dropped

{'b is a': b is a, 'b.asnumpy()':b.asnumpy(), 'c.asnumpy()':c.asnumpy(), 'd.asnumpy()':d.asnumpy()}
```

If the storage types of source array and destination array do not match,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal copyto function requires the shapes of arrays to be same, right?
How does this rowSparse.copyto work when storage types are different since that would require different amounts of memory. I think it would be good to clarify what happens here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source array will be converted if its storage type does not match. I'll add some clarification here


* If sparse outputs are provided, MXNet will convert the dense outputs generated by the dense operator into the provided sparse format.

* For operators that don't specialize in sparse arrays, you can still use them with sparse inputs with some performance penalty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first point is repeated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I noticed that, too. Will remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, they're different...

```

* For operators that don't specialize in sparse arrays, you can still use them with sparse inputs with some performance penalty.
In MXNet, dense operators require all inputs and outputs to be in the dense format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging of these two sentences as one bullet point is odd.

I'd actually suggest merging all points as a paragraph, because there is a flow between them like that.


You can instantiate an executor from a sparse symbol by using the `simple_bind` method,
which allocate zeros to all free variables according to their storage types.
The executor provides `forward` method for evaluation and an attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is evaluation the right word to use here? As I understand calling this function executes the commands ( to allocate zeros to those variables in this case), isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which sentence are you referring to? bind will allocate the memory, forward will perform evaluation

```python
# Create module
mod = mx.mod.Module(symbol=lro, data_names=['data'], label_names=['label'])
# Allocate memory by given the input data and label shapes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: given -> giving

Copy link
Contributor

@bhavinthaker bhavinthaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely written tutorials.

and each row is sparse (i.e. with only a few nonzeros).**

## Advantages of Compressed Sparse Row NDArray (CSRNDArray)
For matrices of high sparsity (e.g. ~1% non-zeros), there are two primary advantages of `CSRNDArray` over the existing `NDArray`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to introduce both terms, sparsity and density. Say, For matrices of high sparsity, also known as low density (e.g. ~1% non-zeros = ~1% density), there are ...

- certain operations are much faster (e.g. matrix-vector multiplication)

You may be familiar with the CSR storage format in [SciPy](https://www.scipy.org/) and will note the similarities in MXNet's implementation. However there are some additional competitive features in `CSRNDArray` inherited from `NDArray`, such as lazy evaluation and automatic parallelization that are not available in SciPy's flavor of CSR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that some newbie user may read the Sparse tutorial without having background on the terms. Hmm, do we want to define the terms here, even though one could argue that these are easy-to-understand terms?

Lazy evaluation means that any operations on the CSRNDArray are not performed immediately, but are delayed until their evaluation is specifically requested. Automatic parallelization means that all operations are automatically executed in parallel with each other without requiring explicit mention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. These are mentioned in the NDArray tutorial which I listed as a prerequisite. I'll add the description nonetheless and add a link for further reading

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I find async nonblocking more appropriate /accurate than lazy evaluation since the evaluation is actually not delayed. https://mxnet.incubator.apache.org/tutorials/basic/ndarray.html#lazy-evaluation-and-automatic-parallelization

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we avoid using the term lazy eval in the original NDArray tutorial, too. It's misleading

The `indptr` array is what will help identify the rows where the data appears. It stores the index into `data` of the first non-zero element number of each row of the matrix. This array always starts with 0 (reasons can be explored later), so indptr[0] is 0. Each subsequent value in the array is the aggregate number of non-zero elements up to that row. Looking at the first row of the matrix you can see two non-zero values, so indptr[1] is 2. The next row contains all zeros, so the aggregate is still 2, so indptr[2] is 2. Finally, you see the last row contains one non-zero element bring the aggregate to 3, so indptr[3] is 3. Your result:

indptr = [0, 2, 2, 3]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a pictorial summary of the above explanation:

[[7, 0, 8, 0]
 [0, 0, 0, 0]
 [0, 9, 0, 0]]

       C0 C1 C2 C3
        |  |  |  |
R0 -> [[7, 0, 8, 0]
R1 ->  [0, 0, 0, 0]
R2 ->  [0, 9, 0, 0]]

       R0    (R1,R2) Row-end => indptr
        |     |      |
       D0 D1 D2      D3
        |  |  |      |
data = [7, 8, 9]     Sentinel

indices = [C0, C2, C1]     = [0, 2, 1]
indptr  = [D0, D2, D2, D3] = [0, 2, 2, 3]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't talked about Sentinel. Can you describe that further?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_list = [7, 8, 9]
indptr_list = [0, 2, 2, 3]
indices_list = [0, 2, 1]
a = mx.nd.sparse.csr_matrix(data_list, indptr_list, indices_list, shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may help to have this example too, saying, that the user does not have to do the calculations to create a CSRNDArray.

darr = np.array([[7., 0., 8., 0.], [0., 0., 0., 0.], [0., 9., 0., 0.]])
dmx = mx.nd.array(darr)
csr = dmx.tostype('csr')
print({'csr': csr, 'csr.data': \
        csr.data, 'csr.indices': \
        csr.indices, \
        'csr.indptr': csr.indptr})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. In fact, we should address this towards the beginning by showing how it describes the data and even extend it to show how these values can be used to regenerate the full dataset.
Also, I think it would help to build up the scenarios a bit more. Using only 0's and 1's to show that duplicated data is fine. (There was some content before that mentioned you couldn't duplicate data, but it was changed to refer to the indices.) Without an explicit example the reader may assume incorrectly that data should be unique. Then, as we move to this example, one or two duplicated numbers would be good as well as a larger dataset. We should have an example that demonstrates the compression factor since the small ones we currently have don't show any savings of space.

shape = (3, 4)
data_list = [7, 8, 9]
indptr_list = [0, 2, 2, 3]
indices_list = [0, 2, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picky: I would suggest keeping the order here the same as the order when these terms were explained, that is, indices_list followed by indptr_list.


## Variables

Variables are placeholder for arrays. We can use them to hold sparse arrays, too.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comma between arrays and too.



```python
var_exec.arg_dict['b'][:] = mx.nd.ones(shape).tostype('csr')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: NameError: name 'var_exec' is not defined

### Storage Type Inference

What will be the output storage types of sparse symbols? In MXNet, for any sparse symbol, the result storage types are inferred based on storage types of inputs.
You can read the [Sparse Symbol API](mxnet.io/api/python/symbol/sparse.html) documentation to find what output storage types are. In the example below we will try out the storage types introduced in the Row Sparse and Compressed Sparse Row tutorials: `default` (dense), `csr`, and `row_sparse`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add http:// to the URL.

and [mx.io.NDArrayIter](https://mxnet.incubator.apache.org/versions/master/api/python/io.html#mxnet.io.NDArrayIter)
support loading sparse data in CSR format. In this example, we'll use the `NDArrayIter`.

You may see some warnings from SciPy. You don't need to worry about those for this example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't find use of SciPy below. What is the reason to see possible warnings from SciPy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mx.test_utils.rand_ndarray may call scipy.random if scipy is available. The random function in scipy converts coo to csr. For some version of scipy, its random function is using a deprecated argument when performing the conversion. This is not what we can control

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the output you get from running the block below.

/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/coo.py:200: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  if np.rank(self.data) != 1 or np.rank(self.row) != 1 or np.rank(self.col) != 1:
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py:130: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  if np.rank(self.data) != 1 or np.rank(self.indices) != 1 or np.rank(self.indptr) != 1:

### Training the model with multiple machines

To train a sparse model with multiple machines, please refer to the example in [mxnet/example/sparse/](https://github.com/apache/incubator-mxnet/tree/master/example/sparse)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the 3 tutorials are well-written!


data = [7, 8, 9]

The `indices` array stores the column index for each non-zero element in `data`. As you cycle through the data array, starting with 7, you can see it is in column 0. Then looking at 8, you can see it is in column 2. Lastly 9 is in column 1. Your result:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to mention here that column indices for a specific row will be looked at in sorted order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's mentioned a few lines below

{'b':b, 'c':c}
```

Note that accessing individual elements in a CSRNDArray is currently not supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also mention that slicing across axis 1 is not supported.


To create a RowSparseNDArray on gpu, we need to explicitly specify the context:

**Note** If a GPU is not available, an error will be reported in the following section. In order to execute it a cpu, set gpu_device to mx.cpu().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to execute it a cpu -> In order to execute it on a cpu

@piiswrong piiswrong merged commit 7b7f9ae into apache:master Oct 4, 2017
mbaijal pushed a commit to mbaijal/incubator-mxnet that referenced this pull request Oct 12, 2017
* add three sparse tutorials

* update csr and rsp

* update train.md

* fix wrong var name

* first round of edits

* first edits and removed outputs

* first round of edits

* first round of edits

* updates based on CR comments

* update tutorial link. It wont work until the PR is merged

* update based on cr comments
@eric-haibin-lin eric-haibin-lin deleted the three-tutorials branch October 18, 2017 17:30
crazy-cat pushed a commit to crazy-cat/incubator-mxnet that referenced this pull request Oct 26, 2017
* add three sparse tutorials

* update csr and rsp

* update train.md

* fix wrong var name

* first round of edits

* first edits and removed outputs

* first round of edits

* first round of edits

* updates based on CR comments

* update tutorial link. It wont work until the PR is merged

* update based on cr comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants