Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

Closed
reminisce opened this issue Feb 25, 2019 · 20 comments
Closed

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

reminisce opened this issue Feb 25, 2019 · 20 comments
Labels
RFC Post requesting for comments

Comments

@reminisce
Copy link
Contributor

Motivation

Today deep learning scientists spend majority of their time on data processing, debugging tensor algorithms, and tuning model parameters, instead of architecting models from scratch by themselves as a result from the abundant pre-trained models existing in many deep learning model zoos. This has highlighted the usability of tensor APIs as a key factor for a framework to be widely adopted.

MXNet was firstly designed with the focus on memory efficiency, computation throughput and scalability. The usability problems begin to show up nowadays when more and more models demonstrate dynamic natures, e.g. unknown-shape tensors before runtime, control flow depending on a runtime result, etc. Here we highlight the most frequent complaints about usability from users.

  • Scalar tensors (aka zero-dim tensors) are not supported. For example, given a = [0, 1, 2], a[1] will generate an NDArray of shape (1,), instead of () as in NumPy.
  • Zero-size tensor is not supported. For example, a tensor of shape (0, 16, 256) cannot be passed to an operator, because our system currently treats 0, the first dimension size, as unknown, rather than a concrete number.
  • Many operators' signatures and functionality are not NumPy compatible, e.g. nd.dot vs. np.dot, nd.concatenate vs. np.concatenate, etc.
  • Many NumPy operators are missing. See the reference link to GitHub issues.
  • Operators whose outputs' shapes can only be determined at runtime are not supported, e.g. data[data < 0] cannot run.
  • Diverged programming experience due to the separation of imperative and symbolic operators registered under mxnet.ndarray and mxnet.symbol.
  • Control flow operators are hard to use. Users have to understand the complicated signatures of control flow operators, instead of writing native Python code using for, while, if/else, etc.
    For example, we have learned (in a hard way) that it does not make a lot of sense to ask users to write code like the following to perform a cumulative sum.
def sum(state, i):
    s = state + data[i]
    return s, [s, i + 1]

def sum_cond(state, i):
    return i < 4
    
out, state = F.contrib.while_loop(sum_cond, sum, [F.zeros((1)), F.zeros((1))],
                                  max_iterations=5)

Instead, users should be able to just write native Python code as the following and if required, let the framework serialize it into a computation graph for optimization and deployment.

data = np.arange(5)
out = 0
i = 0
while i < 5:
    out = out + data[i]

It is not hard to figure out that all of the above pain points can be summarized as a result from lack of NumPy-compatible coding experience in MXNet. While addressing the problems of better support of control flow operators and a consolidated coding style for writing imperative and symbolic code with more flexibility requires introducing fundamental changes into the codebase for building new infrastructures, such as a new graph IR and executor, which is extremely non-trivial and should be executed with a long-term plan, we can, at the moment, improve the usability by fixing the issue of zero-dim/size tensors and implementing NumPy operators in MXNet. Please allow us to discuss how to achieve these short-term goals in the following.

Support of zero-dim and zero-size tensors

What's the problem?

Zero-dim and zero-size tensors are valid tensors in NumPy. The former, whose shapes are (), represent scalars in numpy.ndarray format. The latter, which have one or multiple zero dimension sizes in shapes, can be useful as a placeholder for many ndarray operations, such as concatenating a zero-size ndarray with another ndarray. MXNet does not support them due to the reserved semantics of empty shape () and shapes with zero dimension sizes indicating unknown shape information. Such information need to be filled out during the shape inference stage in order to move forward to tensor computations later.

How to resolve the problem?

We can first change the current semantics to comply with NumPy definition.

  1. Change the definition of unknown shapes from ndim = 0 to ndim = -1 in TShape class.
  2. Change the definition of unknown dimension sizes from dim_size = 0 to dim_size = -1 in TShape class.

After this, we need to scan all over the codebase to modify the code accordingly where shape.ndim() == 0 and shape.Size() == 0 is used to perform unknown shape checks.

Please note that although MXNet's shape is a type inheriting from nnvm::Tuple, which is often used to represent an list-like object, such as axis=(1, 2, 3), we will not change the meaning of an empty tuple. This separation of definitions for empty shape and empty tuple keeps the their roles clearly decoupled.

We propose to breakdown the efforts into the following steps.

  1. Copy tuple.h from NNVM to MXNet and rename nnvm::TShape to mxnet::TShape.
  2. Replace all the places in MXNet where nnvm::Tuple and nnvm::TShape are used with mxnet::Tuple and mxnet::TShape, respectively.
  3. Change the definition of TShape in tuple.h to use ndim = -1 to indicate unknown shapes and dim_size = -1 to indicate unknown shape dim sizes.
  4. Modify all the existing shape inference and utility functions where ndim == 0 and dim_size == 0 is used to accommodate the above changes.
  5. Modify NNVM passes, InferShape, PlanMemory, and Gradient, where nnvm::TShape is used, to accommodate the above changes.
  6. Add sufficient unit tests.

How is backward compatibility guaranteed?

By default, we do not change the original definition of output shapes in shape inference functions; we just change ndim==0 to ndim==-1 for unknown shape verification. No backward compatibility issues are expected for all but one case, NDArray indexing. To elaborate, the current behavior determines that x[i] always returns a tensor with ndim >= 1. We can keep the current behavior unchanged and implement a global switch for users to turn on for expecting NumPy-compatible results.

Previous discussion of this topic can be seen here.

Implementation of NumPy operators

What to do?

To address the problems of operator incompatibility with NumPy and alleviate the pain of diverged programming experience due to the operator namespace separation: mxnet.ndarray and mxnet.symbol, we propose creating a new namespace mxnet.numpy, adopting operator APIs from NumPy, and implementing those operator APIs under the namespace. mxnet.numpy should provide the same imperative programming experience as NumPy and will gradually replace all the non-neural-network operators in the current codebase. While implementing NumPy operators in MXNet, it is possible for us to leverage TVM to generate high-performance kernels (ref.).

Can mxnet.numpy operators be used in Gluon for hybridization?

The newly implemented NumPy operators can still be accessed through the module (ndarray/symbol) delegate F in Gluon, e.g. F.numpy.dot. This works because the new operators are still registered under mxnet.ndarray and mxnet.symbol behind the scene. It is just that users are encouraged to access NumPy operator APIs through mxnet.numpy to write pure imperative code and Gluon APIs for achieving hybrid coding experience.

Where to contribute code?

A dev branch has been opened for this proposal.
https://github.com/apache/incubator-mxnet/tree/numpy

@junrushao1994 @szha @eric-haibin-lin @zheng-da @yzhliu

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

@reminisce reminisce added the RFC Post requesting for comments label Feb 25, 2019
@junrushao
Copy link
Member

+1 for this RFC.

Numpy compatibility has been long existing desire from both developers and users. It is very meaningful if we could make it possible.

@apeforest
Copy link
Contributor

apeforest commented Feb 25, 2019

+1 for this RFC.

The inconsistent APIs even within MXNet operators itself caused much confusing for users. It will be a great improvement in usability if we can make MXNet APIs compatible with Numpy.

I would suggest that we establish a formal review process for PRs that includes API change or addition to prevent from creating inconsistent APIs in the future.

@ifeherva
Copy link
Contributor

ifeherva commented Feb 25, 2019

+1 for this RFC.

I especially like the numpy namespace proposal, that will help cleaning up a lot of thing.

My experience is that the major blocker for numpy compatibility (and bad user experience) is due to the lack of dynamic shape inference. I cannot wait to have that out.

Anyways, since I wrote a handful of operators already I am very happy to lend a hand in getting fully numpy-compatible once dynamic shape inference is done.

@nickguletskii
Copy link
Contributor

+1 for handling zero-size arrays.

I'm not that concerned about numpy compatibility, but the lack of zero-size arrays is something that I would like to see fixed, since the current situation means that empty arrays have to be carefully padded to not cause any problems.

@lanking520
Copy link
Member

+1 for this RFC.

The consistent experience would also help JVM language binding to be in sync with Python. It reduce the bar for users familar with Python to write the same thing in Scala.

@wkcn
Copy link
Member

wkcn commented Feb 27, 2019

+1 for this RFC.

It will be more flexible to use MXNet, especially in slicing, and I hope mx.numpy could eliminate the divergence between mx.nd and mx.sym. : )

I wonder how to implement mx.numpy: using Python ast module to extract the abstract syntax tree then run them on JIT, or implement it on Python entirely? We should also focus on the deployment of mx.numpy.

I do not think F.numpy.dot is a good idea, since it is confusing that mx.numpy, mx.nd.numpy and mx.sym.numpy all exist. We only need mx.numpy to support mx.numpy.dot(a_nd, b_nd) and mx.numpy.dot(a_sym, b_sym).

@reminisce
Copy link
Contributor Author

@wkcn All of what you have said make sense. :)

Gluon APIs, GluonNLP and GluonCV highly depend on the current MXNet infra. So we have to execute it in an organized and steady stream in order not to break backward compatibility. Current NNVM has its own limitations in expressing dynamic shapes and control flow operators. We will eventually need a new IR (Relay is an option) to do AST transformation.

@anirudh2290
Copy link
Member

anirudh2290 commented Feb 28, 2019

Thanks for the RFC!

It is just that users are encouraged to access NumPy operator APIs through mxnet.numpy to write pure imperative code and Gluon APIs for achieving hybrid coding experience.

Earlier mxnet.ndarray was supposed to give you the experience of writing pure imperative code. Why can't we add the operators under this namespace and make the interface changes for existing operators ? Is there a list of operators which have diverged APIs for numpy and ndarray and can it be timed with 2.0 release?

We can keep the current behavior unchanged and implement a global switch for users to turn on for expecting NumPy-compatible results.

If I understand correctly, even when using numpy namespace you need to toggle this switch(probably an env variable?) to obtain the correct slicing ? Have you also considered implementing a seperating numpy ndarray from base with specific functions for slicing like __getitem__ implemented to avoid using this switch.

@szha
Copy link
Member

szha commented Feb 28, 2019

@anirudh2290

Why can't we add the operators under this namespace and make the interface changes for existing operators ?

We can. However, there exist some operators in mxnet.ndarray whose names are the same as numpy counterparts while the behavior are slightly different, this means they cannot exist in the same namespace if we want to preserve backward compatibility. On the other hand, 2.0 is a good opportunity for fixing many of the existing problems besides the operator behaviors, so we'd likely want to take the time. Thus, to start now, having a new namespace would be the most straightforward way to go.

Have you also considered implementing a seperating numpy ndarray

Yes. Creating different array types means we'd start to see diverging user code, with some in ndarray and some in numpy ndarray, which would become harder to migrate later.

@TaoLv
Copy link
Member

TaoLv commented Feb 28, 2019

@reminisce @szha NumPy has reference/view and stride in its NDArray structure whille MXNet.NDArray doesn't have. How does this impact the design of NumPy-compatible coding experience?

@junrushao
Copy link
Member

@TaoLv In neural nets, once you do backprop, you cannot overwrite data because it destroys checkpointing.

@TaoLv
Copy link
Member

TaoLv commented Feb 28, 2019

Not sure I understand the checkpointing. Can you explain a bit more? I think we have memory planning pass to decide whether the data can be overwritten? Also there are NumPy-based framework like Theano and Chainer.

@reminisce
Copy link
Contributor Author

@TaoLv MXNet can have the same concept as in NumPy for view with the implementation of strides. But I think it's not the first priority for us to do so, because they are rarely useful in training (maybe useful in data preprocessing). @junrushao1994 's point is that in-place assignment is invalid in BP as it will wipe out pre-stored autograd information. This is consistent with other DL frameworks.

@apeforest
Copy link
Contributor

apeforest commented Feb 28, 2019

We can. However, there exist some operators in mxnet.ndarray whose names are the same as numpy counterparts while the behavior are slightly different, this means they cannot exist in the same namespace if we want to preserve backward compatibility.

Do we really have to carry this burden of backward compatibility all the way beyond 2.0? I feel existing operators are confusing enough that 2.0 maybe a good time for us to make the API clean and easy to use. Would adding a new name space mx.numpy to the existing mx.sym and mx.ndarray cause more confusion to new users?

@reminisce
Copy link
Contributor Author

@apeforest Because MXNet guarantees backward compatibility, those two namespaces have to be kept till 2.0. Adding namespace numpy lowers the bar for data scientists from NumPy community to use the DL framework. As for the framework itself, the purpose is to deemphasize the difference between mxnet.symbol and mxnet.ndarray in this major release. To retire those two namespaces in 1.x.x, one practical thing in the future we can do is register all ops under the namespaces like numpy, nn, etc. with unified interfaces supporting both NDArray and Symbol arguments, and in Gluon, we can remove the second-level module delegate F..

@apeforest
Copy link
Contributor

apeforest commented Feb 28, 2019

@reminisce I am fine with keeping those two namespaces till 2.0 for backward compatibility. Starting from 2.0, I feel we may want to just drop mx.ndarray, mx.symbol and make mx.numpy the only name space to users. I like the unified interface idea you propopsed.

@mouryarishik
Copy link

+1 for this RFC.

@larroy
Copy link
Contributor

larroy commented May 22, 2019

What's the plan regarding: "Instead, users should be able to just write native Python code as the following and if required, let the framework serialize it into a computation graph for optimization and deployment." I would get the python AST and convert it to a computational graph, seems that part is not described into detail, I guess is a long-term phase.

@szha
Copy link
Member

szha commented Jul 31, 2020

This feature has been made available as experimental feature 1.6 and will be supported in 2.0. Thanks to everyone who contributed to this major feature

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
RFC Post requesting for comments
Projects
None yet
Development

No branches or pull requests