Update Comparison with Other Frameworks #2717

jekbradbury · 2017-05-08T06:25:22Z

As promised in #2685 I have updated the framework comparison table with (almost?) every actively developed deep learning framework and several new axes of comparison. Let me know if anything seems inaccurate or irrelevant.

delta2323 · 2017-05-09T22:01:45Z

The errors that occurred in the Travis CI are now fixed in the master branch (#2716). So, the problem will be solved if you merge the latest master branch.

delta2323 · 2017-05-26T03:11:35Z

I am sorry to make you wait. We are now working on the tasks related v2, which will be released on 30th May. and could not take time to review it. Is it OK to review it after the release of v2?

jekbradbury · 2017-05-26T07:40:18Z

No problem. Looking forward to v2 release!

delta2323 · 2017-06-09T16:29:19Z

Sorry for making you wait. Now I can work on this PR.

bkvogel · 2017-06-19T00:39:53Z

I was wondering if we might want to include another row in the table for acceleration using anything other than cuDNN (such as OpenCL and/or other hardware platform support), but I am not aware of any frameworks with actively-developed support.

delta2323 · 2017-06-19T06:29:40Z

Review status

framework	assignee	status
Chainer	@delta2323	done
PyTorch	@delta2323	done
TensorFlow	@mitmul	done
Theano-based	@niboshi	done
Caffe1/2	@niboshi	done
Torch7	@bkvogel	done
MXNet	@bkvogel	done
DyNet	@mitmul	done
PaddlePaddle	@delta2323	done
DL4J	@delta2323	done
CNTK	@mitmul	done
neon	@delta2323	done
Knet.jl	@delta2323	done
Darknet	@niboshi	done
Thinc	@bkvogel	done

delta2323 · 2017-06-19T06:51:26Z

@jekbradbury Could you let us know it if you had some idea in mind which framework we should pick up or not in making the table? (I should have discussed the matter before assigning reviewers and starting to review each column.).

jekbradbury · 2017-06-19T06:56:32Z

I tried to include every actively developed full-featured deep learning framework that isn't just a wrapper or frontend around another framework. All of them are open source and either used fairly widely by researchers or supported by a major company, with the exception of Darknet, which is prominent in the area of visual object detection and useful if you want to use pure C.

BTW I'd describe non-CUDA GPU support as:
Chainer, Theano, DyNet, PaddlePaddle, DL4J, CNTK, Knet, Darknet, Thinc: CUDA-only for the foreseeable future
TensorFlow: Codeplay has a port to ComputeCPP, their proprietary SYCL-based OpenCL frontend; some subset of that has been merged into core
Torch7, TensorFlow, eventually PyTorch: Hugh Perkins has Coriander-based ports to OpenCL 1.2 but Coriander has performance drawbacks
PyTorch, Caffe1/2, Torch7, MXNet, eventually TensorFlow: AMD has in-progress HIP ports (this is not OpenCL, it's a CUDA-like framework that can cross-compile to both NVidia CUDA and AMD's ROCm API); these use MIOpen, AMD's upcoming cuDNN clone, and are likely to be performance-competitive but rely on AMD developers doing significant porting work and then either maintaining forks or getting their PRs accepted.
Also, of course, TensorFlow supports Google's in-house TPU hardware through a closed-source XLA compiler backend.

delta2323 · 2017-06-21T12:13:05Z

Thank you for your comment on the choice of DL frameworks. It seems reasonable for me.

For the support of non-CUDA GPU, I have no idea for now to what extent we should mention non-official forked repositories as we can expect there are many forks that widen the non-CUDA GPU support and it could be difficult to determine where to draw a line. Do you have any idea?

delta2323 · 2017-06-21T12:14:22Z

Could you let us know what's the difference between "full" and "partial" in CNNs/RNNs rows?

jekbradbury · 2017-06-21T17:47:45Z

It's not particularly precise, but essentially I wanted to distinguish between frameworks that aim to support all major variants/uses of CNNs/RNNs (it may not be easy to write them in the framework, but it is at least possible) and frameworks that have more limited support. For example, Caffe was not designed with NLP in mind and their RNN support is not flexible or customizable; DyNet is an NLP-focused framework that recently added basic convolution and pooling layers but wouldn't be a good choice if you want to write a complex computer vision model. A few frameworks don't intend to support certain use cases at all (e.g. Thinc is only for NLP and Darknet is only for computer vision).

jekbradbury · 2017-06-21T17:58:40Z

Also, I don't think we need to mention non-CUDA at least until AMD officially announces their ports (which will be the first performance-competitive, well-supported deep learning frameworks for non-NVidia GPUs) -- right now they're still in progress on GitHub.

bkvogel · 2017-06-23T08:21:40Z

I checked the table for MXNet, Torch7, and Thinc and think it looks fine. I did not see any inaccurate information.
I also agree that there is no need to mention non-CUDA for now, since the AMD ports are apparently still in progress.

mitmul · 2017-06-26T08:36:11Z

Thank you for the PR! Could you tell me what "Per-batch architectures" means?

jekbradbury · 2017-06-26T08:38:28Z

That row was in the original version of the comparison table. It means that the framework is capable of building a totally different network structure for each batch; that's essentially the same thing as define-by-run but it emphasizes what you can do with it.

mitmul · 2017-06-26T08:39:46Z

Oh, sorry, I didn't notice that. OK, now I understand it! Thank you for the kind explanation :)

niboshi · 2017-06-27T04:35:08Z

Hi, these are my comments for my assignment. Please point me out if anything is incorrect.

General:

What are the differences among "Multi-GPU ~ parallelism", "Multiprocessing", and "Distributed training"?
In "CPU/GPU backend", "custom" could be misunderstood as that "users can use their own custom backend". How about writing as "native" instead?
It's better to put a link to the source if any.

Theano-based:

In "Higher-order grads", it seems to only support Hessian (http://deeplearning.net/software/theano/tutorial/gradients.html). How about writing as "Only Hessian"?
"Multi-GPU dataparallelism seems to be experimental" (https://github.com/Theano/Theano/wiki/Using-Multiple-GPUs). How about writing as "Experimental"?
It seems to have native trainer. (https://github.com/kirk86/theano/blob/master/trainer.py)

Caffe1/2:

It also has MATLAB binding (only Caffe1, though). (http://caffe.berkeleyvision.org/tutorial/interfaces.html)
I couldn't find a source about Multi-GPU model paralellism.
I couldn't find a source about graph computation engine.

Darknet:

There's an RNN example on web site. Shouldn't "RNNs" be "full"?
It doesn't seem to have CPU/GPU generic backend.
It seems to have Multi-GPU data parallelism mechanism (https://github.com/pjreddie/darknet/blob/master/examples/classifier.c#L102, https://github.com/pjreddie/darknet/blob/master/src/network_kernels.cu#L375).
It has "train_network" function, so can't we say it has a native trainer?

delta2323 · 2017-06-27T23:41:02Z

What are the differences among "Multi-GPU ~ parallelism", "Multiprocessing", and "Distributed training"?

I came up a same question when I was checking PyTorch. Why do you think Multiprocessing support of PyTorch is partial?

jekbradbury · 2017-06-28T00:17:11Z

Thanks for the detailed feedback, and for catching a bunch of mistakes! I'll fix the cells I was wrong about soon. Here are some clarifications:

What are the differences among "Multi-GPU ~ parallelism", "Multiprocessing", and "Distributed training"?
Multi-GPU model parallelism: on one machine, placing different parts of a model on different GPUs
Multi-GPU data parallelism: on one machine, replicating the model across GPUs with synchronous data-parallel training (e.g. ParallelUpdater)
Distributed training: training a single model across multiple machines (e.g. ChainerMN)
Multiprocessing: training across multiple OS processes on the same machine (e.g., MultiprocessParallelUpdater) -- this is important for frameworks that run lots of Python code at runtime because Python can only use multiple CPU cores with multiprocessing

In "CPU/GPU backend", "custom" could be misunderstood as that "users can use their own custom backend". How about writing as "native" instead?
"Native" works, but the point I was trying to get across is that those frameworks built their own array types from scratch but don't expose them with separate APIs, meaning that the array backend is less modular/extensible and can't be used on its own.
It's better to put a link to the source if any.
Yes, I'll do that.

Theano-based:
In "Higher-order grads", it seems to only support Hessian (http://deeplearning.net/software/theano/tutorial/gradients.html). How about writing as "Only Hessian"?
No, Theano supports arbitrarily nesting theano.gradients.grad calls (I've used 4+, although it gets very slow). Higher-order grads are most useful for calculating the Hessian, though, so Theano also offers a convenience function for that use case.
"Multi-GPU dataparallelism seems to be experimental" (https://github.com/Theano/Theano/wiki/Using-Multiple-GPUs). How about writing as "Experimental"?
Theano has a very slow development process now, so even things that are several years old and work pretty well are described as "new" and "experimental." Not sure what the best way to describe it in the table here is -- I definitely wouldn't want to use Theano for multi-GPU projects, but that's because I think it would be unnecessarily complicated, not because it would be broken.
It seems to have native trainer. (https://github.com/kirk86/theano/blob/master/trainer.py)
That's a fairly old module in someone's stale branch; Theano currently (by design) leaves things like trainers/iterators/datasets to wrapper packages including Blocks, Lasagne, and Keras.

Caffe1/2:
It also has MATLAB binding (only Caffe1, though). (http://caffe.berkeleyvision.org/tutorial/interfaces.html)
That's true, thanks.
I couldn't find a source about Multi-GPU model paralellism.
It looks like Caffe1 never did implement model parallelism, except in some specific forks where people implemented Alex Krizhevsky's model-parallel AlexNet variant.
On the other hand "Caffe2 also supports model parallelism, but pretty manually. You can assign each operator to different GPU by using DeviceScope." (facebookarchive/caffe2#371) This is similar to most modern frameworks, including Chainer.

Darknet:
There's an RNN example on web site. Shouldn't "RNNs" be "full"?
Those are fairly new -- the LSTM was added three weeks ago. I think that means RNNs should be listed as "partial" since the built-in modules only support vanilla RNNs and classic LSTMs, and they aren't very customizable / the user can't easily add their own.
It doesn't seem to have CPU/GPU generic backend.
You're right; it uses C macros to switch between compiling exclusively for CPU and exclusively for GPU. So that should be "no."
It seems to have Multi-GPU data parallelism mechanism (https://github.com/pjreddie/darknet/blob/master/examples/classifier.c#L102, https://github.com/pjreddie/darknet/blob/master/src/network_kernels.cu#L375).
Yes, it does.
It has "train_network" function, so can't we say it has a native trainer?
I think what I mean by a native trainer is functionality that means the user doesn't have to write a custom training loop for each new model; darknet's examples all have their own training loops while the train_network function only does an SGD update.

jekbradbury · 2017-06-28T00:20:41Z

I described multiprocessing support in PyTorch as partial because it's very difficult (I don't think anyone's made it work yet) to use the torch multiprocessing module to build synchronous multi-GPU training similar to MultiprocessParallelUpdater. Instead it's mostly been used for asynchronous training (Hogwild) on CPU; users are supposed to wait for Distributed PyTorch (which has been merged to master but not released) if they want multi-process multi-GPU training.

delta2323 · 2017-06-28T00:34:57Z

On DL4J:

it has RNN tutorial. I thought its "RNN support" can be "full".
I thought basically at least "Reverse-mode autograd" or "Forward-mode autograd" should be Y, because I thought otherwise we could not train models (same is true of Caffe 1/2 and Darknet). Is my understanding correct?
Why did you consider "cuDNN support" of DL4J is partial?
Can't we use usual debuggers for Java for runtime debugging?

jekbradbury · 2017-06-28T01:22:47Z

I called DL4J's RNN support "partial" because it offers three kinds of RNNs (BaseRecurrent and uni- and bidirectional LSTMs) that are not intended to be modified/customized by the user. In order to add another one, a user would have to implement both the forward and backward passes as raw array operations.
It isn't an autograd-based framework: like darknet and Caffe it's restricted in the ways you can put together network layers -- just a single linear stack, with limited exceptions. You don't need any autograd (i.e., toposort + traversing the graph backwards) to implement these layer-based frameworks and you won't find those capabilities in their code.
I called the cuDNN support partial because it doesn't support cuDNN RNNs.
I will fix the debugging cell; runtime debugging should work.

agibsonccc · 2017-06-28T05:19:55Z

@jekbradbury the cudnn support for RNNs has just a bit more work to finish: deeplearning4j/deeplearning4j#3339 (mainly just lack of bandwidth)

As for the autodiff component: https://github.com/deeplearning4j/nd4j/pull/1750 You can find more on that here. I am intending nd4j to be the "chainer/torch" equivalent. DL4j is likely going to stay higher level closer to keras.

As for runtime debugging, yes it's equivalent if not better than python in this department. The JVM actually supports remote debugging via intellij/eclipse etc and your favorite tools.
You're usually using an equivalent of:
-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8050

to do runtime debugging. You have to explicitly expose it on a port though. Then it's just like your local debugger.
There's also runtime profiling.

The linear stack is just plain wrong though. http://deeplearning4j.org/compgraph allows anything you want.

We will be combining this with the autodiff support in there to be flexible just like the other frameworks in this case.

We will also have a "computation graph" in our auto diff as well.This one will be the traditional "computation graph" with just raw math ops defined with optimizations and the like just like tf/theano/torch etc.

A "graph" + a "workspace" (http://deeplearning4j.org/workspaces) is the equivalent of a "tensorflow session". This will allow for near gc free workloads (due to buffer reuse) across a grph.

Scrolling up seeing some of the other comparisons, I'll also just briefly touch on multi gpu (most folks never get this right).
For distributed training, we have spark and a parameter server based approach. You can see more on that here:
https://deeplearning4j.org/distributed
http://deeplearning4j.org/spark

For single node training we support parallelwrapper:
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-cuda-specific-examples/src/main/java/org/deeplearning4j/examples/multigpu/MultiGpuLenetMnistExample.java

which is basically a data parallel implementation that supports the same knobs our spark implementation does (it makes some assumptions about single node and the like though) Both of these support any arbitrary neural net config.

AlexDBlack · 2017-06-28T05:23:03Z

Adding to @agibsonccc's earlier comment

I called DL4J's RNN support "partial" ... not intended to be modified/customized by the user.

Depends on your definitions of "partial" and "modified/customized". :)
You are correct that adding a new RNN layer (or, new unit, such as GRU) requires manual backprop implementations (for now).
The usual customizations (activation functions, weight inits, TBPTT) are all in there.
DL4J also supports stateful RNNs - i.e., users can do partial forward pass based on the next step/s in a sequence; masking functionality/variable length sequences; global pooling over (variable length) time series etc.
Happy to answer questions if you want more info on any of that.

jekbradbury · 2017-06-29T02:12:24Z

Thanks for the clarifications, Adam and Alex!

bkvogel

The columns for the frameworks I reviewed (Torch7, mxnet, Thinc) LGTM.

delta2323 · 2017-07-12T03:45:52Z

About Knet

Knet relies on KnetArray as GPU backend
Knet.conv4 seems to support cuDNN, but I found most of cuDNN supports are deprecated. So I am wondering if we should mark cuDNN support as "partial" or not.

delta2323 · 2017-07-12T03:47:56Z

@jekbradbury Thank you for updating the table. Also thank you @agibsonccc and @AlexDBlack for your invaluable comments.

delta2323 · 2017-07-14T03:12:43Z

About PaddlePaddle

Could you let me know why you think of CNN support as partial?
You put "full" to cuDNN support. But it seems PaddlePaddle does not use cuDNN for RNN. I judged so because I could not find RNN stuffs here and I found for LSTM ) and I found they directly calls kernels in the the forward propagation of LSTM here. Therefore I think "partial" can be better.

delta2323 · 2017-07-14T03:25:33Z

For CPU/GPU backend, it could be better to fill something in if the frameworks implements array libraries by themselves but is not named (like CuPy). For example, "native" or the same name as the frameworks (as is done in the neon column).

delta2323 · 2017-07-14T03:55:06Z

About neon

Although it has its CPU tensor class, it seems a wrapper of NumPy ndarray class (see here). So, how about writing "CPU backend package" as "Wrapper of NumPy", or simply "NumPy"?
GPU array backend, it depends on PyCUDA for at least default GPU memory allocation, and custom GPU kernels. So same as CPU, "Wrapper of PyCUDA" or "PyCUDA" could be better as "GPU backend package".

jekbradbury · 2017-07-14T05:27:26Z

The reason I listed PaddlePaddle's cuDNN support as "full" is because they wrap everything in cuDNN except the RNN functions, but they implement their own time-fused, cuDNN-like RNN kernels instead (I believe this is because they wrote them before cuDNN RNNs were available). So those are likely to be competitive with cuDNN in performance, which is not the case with most other frameworks' non-cuDNN RNN kernels.

agibsonccc · 2017-07-14T05:46:15Z

Hey folks - Just watching this thread here. http://nd4j.org/backend.html Our equivalent for dl4j is a tensor lib called nd4j. CPU and gpu are supported Basic pitch is "hardware as a jar file" rather than compile/link.

The c++ internals are: https://github.com/deeplearning4j/libnd4j - 1 code base for cpu/gpu (mostly shared business logic for tensor primitives)

delta2323 · 2017-07-18T04:29:50Z

LGTM for the frameworks I reviewed.

niboshi · 2017-07-18T04:41:22Z

@jekbradbury
I'm sorry for delayed reply. Thank you for comments to my review.
Can you add "MATLAB" to "Caffe1/Caffe2" / "Language" cell, and put "Y" to "Darknet" / "Multi-GPU data parallelism" cell, as I wrote?
LGTM otherwise.

niboshi · 2017-07-20T05:49:28Z

Thank you for fix!
LGTM as for my assignments.

delta2323 · 2017-07-20T09:41:02Z

jenkins, test this please.

delta2323

LGTM except one comment

delta2323 · 2017-07-26T06:42:10Z

docs/source/comparison.rst

-.. [6] Also available in the `Torch RNN package <https://github.com/Element-Research/rnn>`_
-.. [7] Via `Platoon <https://github.com/mila-udem/platoon/>`_
-.. [8] `Experimental as May 2016 <http://deeplearning.net/software/theano/tutorial/using_multi_gpu.html>`_
+This table compares Chainer with other actively developed deep learning frameworks. Content is current as of May 2017.


Please change from May to July

delta2323 · 2017-07-26T06:42:30Z

test passed

jekbradbury added 5 commits May 12, 2017 18:09

Update comparison.rst

c166151

Add Thinc + fix links

2a156f0

Add more autograd axes

1b56f94

Fix mistake

87d17f7

clarify MXNet dynamic graph situation

5fa26da

jekbradbury force-pushed the patch-1 branch from 303d989 to 5fa26da Compare May 13, 2017 01:26

ChainerMN released and chainer#2213 merged

0f24f05

delta2323 self-assigned this May 26, 2017

niboshi added the cat:document Documentation such as function documentations, comments and tutorials. label Jun 2, 2017

delta2323 requested review from niboshi, bkvogel and delta2323 June 16, 2017 15:07

mitmul self-assigned this Jun 26, 2017

niboshi self-assigned this Jun 26, 2017

fix mistakes and add links

dc4ac5f

bkvogel approved these changes Jul 3, 2017

View reviewed changes

Address comments for Knet, neon, and PaddlePaddle

4741943

fix Caffe and Darknet

cb92c6a

delta2323 requested changes Jul 26, 2017

View reviewed changes

delta2323 approved these changes Jul 26, 2017

View reviewed changes

Update as-of date

fb26d65

delta2323 merged commit df7f4c8 into chainer:master Jul 27, 2017

delta2323 added this to the v3.0.0b1 milestone Jul 27, 2017

Update Comparison with Other Frameworks #2717

Update Comparison with Other Frameworks #2717

Conversation

jekbradbury commented May 8, 2017

delta2323 commented May 9, 2017 • edited

delta2323 commented May 26, 2017 • edited

jekbradbury commented May 26, 2017

delta2323 commented Jun 9, 2017

bkvogel commented Jun 19, 2017

delta2323 commented Jun 19, 2017 • edited by niboshi

delta2323 commented Jun 19, 2017

jekbradbury commented Jun 19, 2017 • edited

delta2323 commented Jun 21, 2017 • edited

delta2323 commented Jun 21, 2017 • edited

jekbradbury commented Jun 21, 2017

jekbradbury commented Jun 21, 2017

bkvogel commented Jun 23, 2017

mitmul commented Jun 26, 2017

jekbradbury commented Jun 26, 2017

mitmul commented Jun 26, 2017

niboshi commented Jun 27, 2017 • edited

delta2323 commented Jun 27, 2017

jekbradbury commented Jun 28, 2017

jekbradbury commented Jun 28, 2017

delta2323 commented Jun 28, 2017

jekbradbury commented Jun 28, 2017

agibsonccc commented Jun 28, 2017 • edited

AlexDBlack commented Jun 28, 2017

jekbradbury commented Jun 29, 2017

bkvogel left a comment

Choose a reason for hiding this comment

delta2323 commented Jul 12, 2017

delta2323 commented Jul 12, 2017

delta2323 commented Jul 14, 2017

delta2323 commented Jul 14, 2017 • edited

delta2323 commented Jul 14, 2017

jekbradbury commented Jul 14, 2017

agibsonccc commented Jul 14, 2017

delta2323 commented Jul 18, 2017

niboshi commented Jul 18, 2017

niboshi commented Jul 20, 2017

delta2323 commented Jul 20, 2017

delta2323 left a comment

Choose a reason for hiding this comment

delta2323 Jul 26, 2017

Choose a reason for hiding this comment

delta2323 commented Jul 26, 2017

delta2323 commented May 9, 2017 •

edited

delta2323 commented May 26, 2017 •

edited

delta2323 commented Jun 19, 2017 •

edited by niboshi

jekbradbury commented Jun 19, 2017 •

edited

delta2323 commented Jun 21, 2017 •

edited

delta2323 commented Jun 21, 2017 •

edited

niboshi commented Jun 27, 2017 •

edited

agibsonccc commented Jun 28, 2017 •

edited

delta2323 commented Jul 14, 2017 •

edited