-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using CuArrays #417
using CuArrays #417
Conversation
Tests pass on MIT supercloud as well. @ekinakyurek, Can you take this branch (git checkout cuarrays) and do some memory/time benchmarks on your big models comparing with master? The goal is to optimize the allocation strategy of cuarrays and make it as good or better than Knet if not already. We will test/optimize the kernels in the next step: this branch is mostly using Knet kernels now. |
I am looking for it, will update you |
I tested on supercloud which has Nvidia V100 GPUs.
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 163228 C .../gridsan/eakyurek/julia-1.0.1/bin/julia 10374MiB |
| 1 183653 C .../gridsan/eakyurek/julia-1.0.1/bin/julia 8536MiB |
+-----------------------------------------------------------------------------+
knetarrays:
cuarrays
|
@maleadt : this shows about 40% speed and 20% memory regression when switched from KnetArray to CuArrays allocation. Any ideas? @ekinakyurek : did you do any initial warm-up iterations to ignore the difference due to compiling? @MikeInnes: does this version have the same behavior with Knet in early garbage collecting? |
Not sure the memory regression is telling anything, CuArrays might just be holding on to allocations a little longer (I don't know how the Knet pooling allocator works), or this is due to the recent "large buffer bailout" (keeping large buffers outside of the pool). The runtime regression is obviously pretty serious, this would be a useful test-case to take some timings with and spot inefficiencies or tune the algorithm. Any instructions to reproduce these measurements? Although I won't have much time to spend at this. |
I pointed the repo to replicate the experiment in the first comment. There is a very short README.md(use option b for data) to setup package and download data. There is also job.sh to run the experiment. However, only challenge is that you need to have ~70GB RAM to run this. Because loading image data from file made the code very slow, I setup it in this way. I couldn't find a fast way to load parallel data. I also opened issue (JuliaIO/HDF5.jl#518) about that a while ago. |
Can't we setup a test with artificial data to decrease the barrier to
entry? A simple script that doesn't load anything but still goes through
the motions to give memory/time benchmarks.
…On Sat, Dec 15, 2018, 12:13 PM Ekin Akyürek ***@***.***> wrote:
I pointed the repo to replicate the experiment in the first comment. There
is a very short README.md
<https://github.com/ekinakyurek/Mac-Network-Knet/blob/master/README.md>
to setup package and download data. There is also job.sh
<https://github.com/ekinakyurek/Mac-Network-Knet/blob/master/job.sh> to
run the experiment.
However, only challenge is that you need to have ~70GB RAM to run this.
Because loading image data from file made the code very slow, I setup it in
this way. I couldn't find a fast way to load parallel data. I also opened
issue (JuliaIO/HDF5.jl#518 <JuliaIO/HDF5.jl#518>)
about that a while ago.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#417 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNpuD36LJmJeKHMIMd9nIo1OaUlZelks5u5S2_gaJpZM4ZS2h9>
.
|
Okey, I added benchmark.jl to test and you don't need to install data.
https://github.com/ekinakyurek/Mac-Network-Knet/blob/master/benchmark.jl |
Not sure how large your arrays are, but the pooling slowdown is likely down to us not pooling large arrays. You can tweak that here to test that (it'll probably make peak usage worse). If we can confirm that then it's easy enough for me to fix it. @maleadt do you have any thoughts or warnings about using multiple GPUs and blas handles etc? Right now the tests fail e.g. on cyclops for this reason, but it's a little unclear to me what the bad interactions are. Might be worth trying to reproduce that and see the error. |
I was just training the same network and new timing on current master of Knet: |
Is this for KnetArrays or CuArrays? |
I just tried with a single forwards/backwards pass of @ekinakyurek's benchmark and found both master and this branch to be around 12s. (This is on the GTX 1080 Ti on Cyclops, and straight Have I done something dumb here, or does the perf difference only show up with more iterations, or something else? |
I believe there is a mistake in your setup, or GTX 1080 Ti doesn't work well with both branches. Here is the benchmark results in a Tesla V100 machine with Julia1.1.0. Knet#master julia> @time benchmark(M,o;N=100)
29.891977 seconds (11.12 M allocations: 5.232 GiB, 7.43% gc time)
julia> @time benchmark(M,o;N=100)
24.664394 seconds (11.12 M allocations: 5.232 GiB, 6.71% gc time)
julia> @time benchmark(M,o;N=1000)
244.700195 seconds (111.28 M allocations: 52.324 GiB, 6.23% gc time) Knet#cuarrays julia> @time benchmark(M,o;N=100)
35.008532 seconds (18.46 M allocations: 10.327 GiB, 10.82% gc time)
julia> @time benchmark(M,o;N=100)
33.461694 seconds (18.45 M allocations: 10.326 GiB, 11.60% gc time)
julia> @time benchmark(M,o;N=1000)
329.063674 seconds (184.61 M allocations: 103.264 GiB, 10.29% gc time) Note: As I said earlier, these results are not realistic because we use unequal sequence sizes in the real data which is more challenging for memory management. |
It was KnetArrays, but there can be a mistake because I was transitioning the code. So, I believe that's why above result is better than previous ones. I will update you |
@denizyuret the first benchmarks are still same. I mean 1 epoch ~39minutes. Sorry, for the confusion |
How much memory does this benchmark require? It OOMs on both my 4GB GTX970 and 6GB Titan... |
If I remember correctly it works on K80 (12GB), we tested on V100 (16GB).
With good memory management it prohably will run on 8GB (Ekin can check the
pytorch impl as reference). This was the model that forced me to add
active gc in backward pass for KnetArrays.
…On Thu, Feb 7, 2019 at 6:31 PM Tim Besard ***@***.***> wrote:
How much memory does this benchmark require? It OOMs on both my 4GB GTX970
and 6GB Titan...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#417 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNptbAdUogw1gARL5bYqrZIoKRXJe_ks5vLEbpgaJpZM4ZS2h9>
.
|
Got it working on a different system. CuArrays.jl (
Regular Knet (
So yeah... This is on a GTX 2080 with 11GB RAM, Julia 1.1. |
Interesting -- Knet seems to do half the allocation (both in number and
total size) but spend 10x more time on GC. Ekin, do you have any results
comparing the two branches?
…On Thu, Feb 7, 2019 at 7:13 PM Tim Besard ***@***.***> wrote:
Got it working on a different system.
CuArrays.jl (cuarrays branch)
julia> @time benchmark(M,o;N=100)
35.288962 seconds (19.78 M allocations: 10.385 GiB, 4.16% gc time)
julia> @time benchmark(M,o;N=100)
34.870759 seconds (19.78 M allocations: 10.385 GiB, 4.22% gc time)
julia> @time benchmark(M,o;N=100)
34.796848 seconds (19.77 M allocations: 10.385 GiB, 4.25% gc time)
Regular Knet (v0.8.2-803-g0a2320f8, closest master branch commit)
julia> @time benchmark(M,o;N=100)
66.928443 seconds (11.12 M allocations: 5.232 GiB, 50.40% gc time)
julia> @time benchmark(M,o;N=100)
68.562458 seconds (11.12 M allocations: 5.232 GiB, 51.33% gc time)
So yeah... This is on a GTX 2080 with 8GB RAM, Julia 1.1.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#417 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNpttOWTVvkshcPAMiq9QxCTLJ4ewUks5vLFCwgaJpZM4ZS2h9>
.
|
Yes, I shared them in below. @maleadt Knet master in v1.2.0+ version. How did you use v0.8.2 ?
|
Hmm now it takes over 100 seconds for both implementations; something seems off. |
FWIW, I could reproduce the timings once more on proper branches (
Knet
I've never seen Knet perform better, but CuArrays took over 100s at one point. Not sure where that came from, does the model have any nondeterminism? Also, with CuArrays we can inspect timings more in detail, and GC time seems reasonable:
|
I've pushed some fixes for the latest CuArrays/CUDAdrv/CUDAnative. Use CUDAdrv/CUDAnative from master and CuArrays from JuliaGPU/CuArrays.jl#275. |
Let me tell that 8GB GPU is a good test environment for stress testing. So, 12GB GPU's would be better to test for normal conditions (assuming ML development environments mostly have 12GB). I also did tests in Nvidia K80, so results are different again: knet#master julia> @time benchmark(M,o;N=100)
65.345801 seconds (11.11 M allocations: 5.232 GiB, 0.58% gc time) Knet#cuarrays julia> @time benchmark(M,o;N=100)
79.009186 seconds (18.44 M allocations: 10.326 GiB, 1.38% gc time) Do you test with last updates to CuArrays? If so, how can I make my Knet#cuarrays branch to use latest CuArrays? |
Okey, I am checking out to the lates branches of CuArrays and will try again |
I run the script, which is provided by @maleadt, on an AWS-p2.xlarge(a K80 GPU) instance with the AMI Image(ami-01df61498d474ecd2). Here is the script Here is the results: Knet#cuarrays 80.332260 seconds (19.77 M allocations: 10.385 GiB, 1.82% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
80.397125 seconds (19.77 M allocations: 10.385 GiB, 1.83% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
80.449783 seconds (19.77 M allocations: 10.385 GiB, 1.84% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing Knet#master 68.879578 seconds (11.11 M allocations: 5.232 GiB, 0.96% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
68.964452 seconds (11.12 M allocations: 5.232 GiB, 0.75% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
69.143434 seconds (11.12 M allocations: 5.232 GiB, 0.76% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
|
Great, thanks. That matches my most recent measurements, and leaves me stumped how I ended up with those (consistently reproducible within that session) measurements in #417 (comment) |
Self-reminder that this needs to be updated for JuliaGPU/CuArrays.jl#275. |
I already pushed fixes onto this branch for JuliaGPU/CUDAdrv.jl#125 and JuliaGPU/CuArrays.jl#275. I've been doing some optimizations of the CuArrays allocator in JuliaGPU/CuArrays.jl#277, resulting in the following timings on my 2080Ti.
What's curious here is the increase in number of CPU allocations -- probably because of nesting a non-isbits (ie. needs to be allocated) CuArray within KNetArray? In one of the runs, Knet+CuArrays triggered the fast execution from #417 (comment) again, so I'm also suspecting some nondeterminism somewhere.
On
@ekinakyurek, could you have a run with |
The benchmark above now consistently performs faster on my 2080Ti with Knet+CuArrays:
|
Hi @maleadt, @MikeInnes , @denizyuret I benchmarked again and the results are still similar with the older experiments that I run on K80 and V100 GPU machines. I think you somehow overfit the allocation strategy to 2080Ti or something else. AWS-P2Xlarge-K80 Benchmark InfoHere is the updated script knet#cuarrays 78.341572 seconds (18.58 M allocations: 10.364 GiB, 1.42% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
78.549632 seconds (18.58 M allocations: 10.364 GiB, 1.41% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
78.663026 seconds (18.58 M allocations: 10.364 GiB, 1.42% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing knet#master arrtype = KnetArray{Float32,N} where N
68.307643 seconds (11.11 M allocations: 5.232 GiB, 0.75% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
68.499359 seconds (11.12 M allocations: 5.232 GiB, 0.61% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
68.678919 seconds (11.12 M allocations: 5.232 GiB, 0.61% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing Note:I believe if you could explain a bit what you updated in CuArrays after benchmarks and how it differs than Knet's allocation strategy or how is different than previous version of CuArrays that would be more helpful for everybody. |
There's not much to fit though, the allocator hasn't got any params (except for the idle-time memory freeing, but that isn't relevant here). The only thing I can imagine is whether we first try to
The allocator has been refactored and its behavior is now pretty clear (IMO) from the main state machine's code: https://github.com/JuliaGPU/CuArrays.jl/blob/f5cbcfcf329b7e15a9bbbf0f344c5613a83e7148/src/memory.jl#L240-L330 Furthermore, I added timer outputs which yield an output like this when calling
Adding similar timings to Knet should make it possible to compare behavior, both in terms of how many allocations happen and how much time is spent at them. |
@ekinakyurek do you have any timings comparing this with the max_pool setting I pointed out earlier? Would be useful to know if disabling that gives you any speedup on your setup. |
@Keno is looking at GPU performance now, so might have some idea of what the low hanging fruit is. |
@MikeInnes sorry for the late reply, I am going to look at it today. |
Still looking into it. Hopefully I can just fix the low hanging fruits and things will magically get faster. |
@MikeInnes |
1.4 has CuArrays fully integrated. |
All tests now pass, save for the JLD code which we need to replace.
Tests fail on multi-GPU setups,
and segfault at the end due to a stream destructor. I think we need to rework all the handles and runtime stuff to do things the CuArrays way, e.g. through CUDAdrv. I'm not sure what our multi-gpu or stream story is in terms of cublas handles etc though.In any case we mainly want to do benchmarking to begin with, and this should be ready enough for that.