New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Comparison with Other Frameworks #2717

Merged
merged 11 commits into from Jul 27, 2017

Conversation

@jekbradbury
Contributor

jekbradbury commented May 8, 2017

As promised in #2685 I have updated the framework comparison table with (almost?) every actively developed deep learning framework and several new axes of comparison. Let me know if anything seems inaccurate or irrelevant.

@delta2323

This comment has been minimized.

Member

delta2323 commented May 9, 2017

The errors that occurred in the Travis CI are now fixed in the master branch (#2716). So, the problem will be solved if you merge the latest master branch.

@delta2323

This comment has been minimized.

Member

delta2323 commented May 26, 2017

I am sorry to make you wait. We are now working on the tasks related v2, which will be released on 30th May. and could not take time to review it. Is it OK to review it after the release of v2?

@delta2323 delta2323 self-assigned this May 26, 2017

@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented May 26, 2017

No problem. Looking forward to v2 release!

@niboshi niboshi added the document label Jun 2, 2017

@delta2323

This comment has been minimized.

Member

delta2323 commented Jun 9, 2017

Sorry for making you wait. Now I can work on this PR.

@delta2323 delta2323 requested review from niboshi, bkvogel and delta2323 Jun 16, 2017

@bkvogel

This comment has been minimized.

Contributor

bkvogel commented Jun 19, 2017

I was wondering if we might want to include another row in the table for acceleration using anything other than cuDNN (such as OpenCL and/or other hardware platform support), but I am not aware of any frameworks with actively-developed support.

@delta2323

This comment has been minimized.

Member

delta2323 commented Jun 19, 2017

Review status

framework assignee status
Chainer @delta2323 done
PyTorch @delta2323 done
TensorFlow @mitmul done
Theano-based @niboshi done
Caffe1/2 @niboshi done
Torch7 @bkvogel done
MXNet @bkvogel done
DyNet @mitmul done
PaddlePaddle @delta2323 done
DL4J @delta2323 done
CNTK @mitmul done
neon @delta2323 done
Knet.jl @delta2323 done
Darknet @niboshi done
Thinc @bkvogel done
@delta2323

This comment has been minimized.

Member

delta2323 commented Jun 19, 2017

@jekbradbury Could you let us know it if you had some idea in mind which framework we should pick up or not in making the table? (I should have discussed the matter before assigning reviewers and starting to review each column.).

@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jun 19, 2017

I tried to include every actively developed full-featured deep learning framework that isn't just a wrapper or frontend around another framework. All of them are open source and either used fairly widely by researchers or supported by a major company, with the exception of Darknet, which is prominent in the area of visual object detection and useful if you want to use pure C.

BTW I'd describe non-CUDA GPU support as:
Chainer, Theano, DyNet, PaddlePaddle, DL4J, CNTK, Knet, Darknet, Thinc: CUDA-only for the foreseeable future
TensorFlow: Codeplay has a port to ComputeCPP, their proprietary SYCL-based OpenCL frontend; some subset of that has been merged into core
Torch7, TensorFlow, eventually PyTorch: Hugh Perkins has Coriander-based ports to OpenCL 1.2 but Coriander has performance drawbacks
PyTorch, Caffe1/2, Torch7, MXNet, eventually TensorFlow: AMD has in-progress HIP ports (this is not OpenCL, it's a CUDA-like framework that can cross-compile to both NVidia CUDA and AMD's ROCm API); these use MIOpen, AMD's upcoming cuDNN clone, and are likely to be performance-competitive but rely on AMD developers doing significant porting work and then either maintaining forks or getting their PRs accepted.
Also, of course, TensorFlow supports Google's in-house TPU hardware through a closed-source XLA compiler backend.

@delta2323

This comment has been minimized.

Member

delta2323 commented Jun 21, 2017

Thank you for your comment on the choice of DL frameworks. It seems reasonable for me.

For the support of non-CUDA GPU, I have no idea for now to what extent we should mention non-official forked repositories as we can expect there are many forks that widen the non-CUDA GPU support and it could be difficult to determine where to draw a line. Do you have any idea?

@delta2323

This comment has been minimized.

Member

delta2323 commented Jun 21, 2017

Could you let us know what's the difference between "full" and "partial" in CNNs/RNNs rows?

@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jun 21, 2017

It's not particularly precise, but essentially I wanted to distinguish between frameworks that aim to support all major variants/uses of CNNs/RNNs (it may not be easy to write them in the framework, but it is at least possible) and frameworks that have more limited support. For example, Caffe was not designed with NLP in mind and their RNN support is not flexible or customizable; DyNet is an NLP-focused framework that recently added basic convolution and pooling layers but wouldn't be a good choice if you want to write a complex computer vision model. A few frameworks don't intend to support certain use cases at all (e.g. Thinc is only for NLP and Darknet is only for computer vision).

@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jun 21, 2017

Also, I don't think we need to mention non-CUDA at least until AMD officially announces their ports (which will be the first performance-competitive, well-supported deep learning frameworks for non-NVidia GPUs) -- right now they're still in progress on GitHub.

@bkvogel

This comment has been minimized.

Contributor

bkvogel commented Jun 23, 2017

I checked the table for MXNet, Torch7, and Thinc and think it looks fine. I did not see any inaccurate information.
I also agree that there is no need to mention non-CUDA for now, since the AMD ports are apparently still in progress.

@mitmul mitmul self-assigned this Jun 26, 2017

@niboshi niboshi self-assigned this Jun 26, 2017

@mitmul

This comment has been minimized.

Member

mitmul commented Jun 26, 2017

Thank you for the PR! Could you tell me what "Per-batch architectures" means?

@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jun 26, 2017

That row was in the original version of the comparison table. It means that the framework is capable of building a totally different network structure for each batch; that's essentially the same thing as define-by-run but it emphasizes what you can do with it.

@mitmul

This comment has been minimized.

Member

mitmul commented Jun 26, 2017

Oh, sorry, I didn't notice that. OK, now I understand it! Thank you for the kind explanation :)

@mitmul

This comment has been minimized.

Member

mitmul commented Jun 26, 2017

So, how about changing the row title to "Different architectures per-batch"?

@delta2323

This comment has been minimized.

Member

delta2323 commented Jun 27, 2017

What are the differences among "Multi-GPU ~ parallelism", "Multiprocessing", and "Distributed training"?

I came up a same question when I was checking PyTorch. Why do you think Multiprocessing support of PyTorch is partial?

@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jun 28, 2017

Thanks for the detailed feedback, and for catching a bunch of mistakes! I'll fix the cells I was wrong about soon. Here are some clarifications:

What are the differences among "Multi-GPU ~ parallelism", "Multiprocessing", and "Distributed training"?
Multi-GPU model parallelism: on one machine, placing different parts of a model on different GPUs
Multi-GPU data parallelism: on one machine, replicating the model across GPUs with synchronous data-parallel training (e.g. ParallelUpdater)
Distributed training: training a single model across multiple machines (e.g. ChainerMN)
Multiprocessing: training across multiple OS processes on the same machine (e.g., MultiprocessParallelUpdater) -- this is important for frameworks that run lots of Python code at runtime because Python can only use multiple CPU cores with multiprocessing

In "CPU/GPU backend", "custom" could be misunderstood as that "users can use their own custom backend". How about writing as "native" instead?
"Native" works, but the point I was trying to get across is that those frameworks built their own array types from scratch but don't expose them with separate APIs, meaning that the array backend is less modular/extensible and can't be used on its own.
It's better to put a link to the source if any.
Yes, I'll do that.

Theano-based:
In "Higher-order grads", it seems to only support Hessian (http://deeplearning.net/software/theano/tutorial/gradients.html). How about writing as "Only Hessian"?

No, Theano supports arbitrarily nesting theano.gradients.grad calls (I've used 4+, although it gets very slow). Higher-order grads are most useful for calculating the Hessian, though, so Theano also offers a convenience function for that use case.
"Multi-GPU dataparallelism seems to be experimental" (https://github.com/Theano/Theano/wiki/Using-Multiple-GPUs). How about writing as "Experimental"?
Theano has a very slow development process now, so even things that are several years old and work pretty well are described as "new" and "experimental." Not sure what the best way to describe it in the table here is -- I definitely wouldn't want to use Theano for multi-GPU projects, but that's because I think it would be unnecessarily complicated, not because it would be broken.
It seems to have native trainer. (https://github.com/kirk86/theano/blob/master/trainer.py)
That's a fairly old module in someone's stale branch; Theano currently (by design) leaves things like trainers/iterators/datasets to wrapper packages including Blocks, Lasagne, and Keras.

Caffe1/2:
It also has MATLAB binding (only Caffe1, though). (http://caffe.berkeleyvision.org/tutorial/interfaces.html)

That's true, thanks.
I couldn't find a source about Multi-GPU model paralellism.
It looks like Caffe1 never did implement model parallelism, except in some specific forks where people implemented Alex Krizhevsky's model-parallel AlexNet variant.
On the other hand "Caffe2 also supports model parallelism, but pretty manually. You can assign each operator to different GPU by using DeviceScope." (caffe2/caffe2#371) This is similar to most modern frameworks, including Chainer.

Darknet:
There's an RNN example on web site. Shouldn't "RNNs" be "full"?

Those are fairly new -- the LSTM was added three weeks ago. I think that means RNNs should be listed as "partial" since the built-in modules only support vanilla RNNs and classic LSTMs, and they aren't very customizable / the user can't easily add their own.
It doesn't seem to have CPU/GPU generic backend.
You're right; it uses C macros to switch between compiling exclusively for CPU and exclusively for GPU. So that should be "no."
It seems to have Multi-GPU data parallelism mechanism (https://github.com/pjreddie/darknet/blob/master/examples/classifier.c#L102, https://github.com/pjreddie/darknet/blob/master/src/network_kernels.cu#L375).
Yes, it does.
It has "train_network" function, so can't we say it has a native trainer?
I think what I mean by a native trainer is functionality that means the user doesn't have to write a custom training loop for each new model; darknet's examples all have their own training loops while the train_network function only does an SGD update.

@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jun 28, 2017

I described multiprocessing support in PyTorch as partial because it's very difficult (I don't think anyone's made it work yet) to use the torch multiprocessing module to build synchronous multi-GPU training similar to MultiprocessParallelUpdater. Instead it's mostly been used for asynchronous training (Hogwild) on CPU; users are supposed to wait for Distributed PyTorch (which has been merged to master but not released) if they want multi-process multi-GPU training.

@delta2323

This comment has been minimized.

Member

delta2323 commented Jun 28, 2017

On DL4J:

  • it has RNN tutorial. I thought its "RNN support" can be "full".
  • I thought basically at least "Reverse-mode autograd" or "Forward-mode autograd" should be Y, because I thought otherwise we could not train models (same is true of Caffe 1/2 and Darknet). Is my understanding correct?
  • Why did you consider "cuDNN support" of DL4J is partial?
  • Can't we use usual debuggers for Java for runtime debugging?
@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jun 28, 2017

I called DL4J's RNN support "partial" because it offers three kinds of RNNs (BaseRecurrent and uni- and bidirectional LSTMs) that are not intended to be modified/customized by the user. In order to add another one, a user would have to implement both the forward and backward passes as raw array operations.
It isn't an autograd-based framework: like darknet and Caffe it's restricted in the ways you can put together network layers -- just a single linear stack, with limited exceptions. You don't need any autograd (i.e., toposort + traversing the graph backwards) to implement these layer-based frameworks and you won't find those capabilities in their code.
I called the cuDNN support partial because it doesn't support cuDNN RNNs.
I will fix the debugging cell; runtime debugging should work.

@agibsonccc

This comment has been minimized.

agibsonccc commented Jun 28, 2017

@jekbradbury the cudnn support for RNNs has just a bit more work to finish: deeplearning4j/deeplearning4j#3339 (mainly just lack of bandwidth)

As for the autodiff component: deeplearning4j/nd4j#1750 You can find more on that here. I am intending nd4j to be the "chainer/torch" equivalent. DL4j is likely going to stay higher level closer to keras.

As for runtime debugging, yes it's equivalent if not better than python in this department. The JVM actually supports remote debugging via intellij/eclipse etc and your favorite tools.
You're usually using an equivalent of:
-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8050

to do runtime debugging. You have to explicitly expose it on a port though. Then it's just like your local debugger.
There's also runtime profiling.

The linear stack is just plain wrong though. http://deeplearning4j.org/compgraph allows anything you want.

We will be combining this with the autodiff support in there to be flexible just like the other frameworks in this case.

We will also have a "computation graph" in our auto diff as well.This one will be the traditional "computation graph" with just raw math ops defined with optimizations and the like just like tf/theano/torch etc.

A "graph" + a "workspace" (http://deeplearning4j.org/workspaces) is the equivalent of a "tensorflow session". This will allow for near gc free workloads (due to buffer reuse) across a grph.

Scrolling up seeing some of the other comparisons, I'll also just briefly touch on multi gpu (most folks never get this right).
For distributed training, we have spark and a parameter server based approach. You can see more on that here:
https://deeplearning4j.org/distributed
http://deeplearning4j.org/spark

For single node training we support parallelwrapper:
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-cuda-specific-examples/src/main/java/org/deeplearning4j/examples/multigpu/MultiGpuLenetMnistExample.java

which is basically a data parallel implementation that supports the same knobs our spark implementation does (it makes some assumptions about single node and the like though) Both of these support any arbitrary neural net config.

@AlexDBlack

This comment has been minimized.

AlexDBlack commented Jun 28, 2017

Adding to @agibsonccc's earlier comment

I called DL4J's RNN support "partial" ... not intended to be modified/customized by the user.

Depends on your definitions of "partial" and "modified/customized". :)
You are correct that adding a new RNN layer (or, new unit, such as GRU) requires manual backprop implementations (for now).
The usual customizations (activation functions, weight inits, TBPTT) are all in there.
DL4J also supports stateful RNNs - i.e., users can do partial forward pass based on the next step/s in a sequence; masking functionality/variable length sequences; global pooling over (variable length) time series etc.
Happy to answer questions if you want more info on any of that.

@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jun 29, 2017

Thanks for the clarifications, Adam and Alex!

@bkvogel

bkvogel approved these changes Jul 3, 2017

The columns for the frameworks I reviewed (Torch7, mxnet, Thinc) LGTM.

@delta2323

This comment has been minimized.

Member

delta2323 commented Jul 12, 2017

About Knet

  • Knet relies on KnetArray as GPU backend
  • Knet.conv4 seems to support cuDNN, but I found most of cuDNN supports are deprecated. So I am wondering if we should mark cuDNN support as "partial" or not.
@delta2323

This comment has been minimized.

Member

delta2323 commented Jul 12, 2017

@jekbradbury Thank you for updating the table. Also thank you @agibsonccc and @AlexDBlack for your invaluable comments.

@delta2323

This comment has been minimized.

Member

delta2323 commented Jul 14, 2017

About PaddlePaddle

  • Could you let me know why you think of CNN support as partial?
  • You put "full" to cuDNN support. But it seems PaddlePaddle does not use cuDNN for RNN. I judged so because I could not find RNN stuffs here and I found for LSTM ) and I found they directly calls kernels in the the forward propagation of LSTM here. Therefore I think "partial" can be better.
@delta2323

This comment has been minimized.

Member

delta2323 commented Jul 14, 2017

For CPU/GPU backend, it could be better to fill something in if the frameworks implements array libraries by themselves but is not named (like CuPy). For example, "native" or the same name as the frameworks (as is done in the neon column).

@delta2323

This comment has been minimized.

Member

delta2323 commented Jul 14, 2017

About neon

  • Although it has its CPU tensor class, it seems a wrapper of NumPy ndarray class (see here). So, how about writing "CPU backend package" as "Wrapper of NumPy", or simply "NumPy"?
  • GPU array backend, it depends on PyCUDA for at least default GPU memory allocation, and custom GPU kernels. So same as CPU, "Wrapper of PyCUDA" or "PyCUDA" could be better as "GPU backend package".
@jekbradbury

This comment has been minimized.

Contributor

jekbradbury commented Jul 14, 2017

The reason I listed PaddlePaddle's cuDNN support as "full" is because they wrap everything in cuDNN except the RNN functions, but they implement their own time-fused, cuDNN-like RNN kernels instead (I believe this is because they wrote them before cuDNN RNNs were available). So those are likely to be competitive with cuDNN in performance, which is not the case with most other frameworks' non-cuDNN RNN kernels.

@agibsonccc

This comment has been minimized.

agibsonccc commented Jul 14, 2017

Hey folks - Just watching this thread here. http://nd4j.org/backend.html Our equivalent for dl4j is a tensor lib called nd4j. CPU and gpu are supported Basic pitch is "hardware as a jar file" rather than compile/link.

The c++ internals are: https://github.com/deeplearning4j/libnd4j - 1 code base for cpu/gpu (mostly shared business logic for tensor primitives)

@delta2323

This comment has been minimized.

Member

delta2323 commented Jul 18, 2017

LGTM for the frameworks I reviewed.

@niboshi

This comment has been minimized.

Member

niboshi commented Jul 18, 2017

@jekbradbury
I'm sorry for delayed reply. Thank you for comments to my review.
Can you add "MATLAB" to "Caffe1/Caffe2" / "Language" cell, and put "Y" to "Darknet" / "Multi-GPU data parallelism" cell, as I wrote?
LGTM otherwise.

@niboshi

This comment has been minimized.

Member

niboshi commented Jul 20, 2017

Thank you for fix!
LGTM as for my assignments.

@delta2323

This comment has been minimized.

Member

delta2323 commented Jul 20, 2017

jenkins, test this please.

@delta2323

LGTM except one comment

.. [6] Also available in the `Torch RNN package <https://github.com/Element-Research/rnn>`_
.. [7] Via `Platoon <https://github.com/mila-udem/platoon/>`_
.. [8] `Experimental as May 2016 <http://deeplearning.net/software/theano/tutorial/using_multi_gpu.html>`_
This table compares Chainer with other actively developed deep learning frameworks. Content is current as of May 2017.

This comment has been minimized.

@delta2323

delta2323 Jul 26, 2017

Member

Please change from May to July

@delta2323

This comment has been minimized.

Member

delta2323 commented Jul 26, 2017

test passed

@delta2323 delta2323 merged commit df7f4c8 into chainer:master Jul 27, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@delta2323 delta2323 added this to the v3.0.0b1 milestone Jul 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment