using CuArrays #417

MikeInnes · 2018-12-14T04:06:37Z

All tests now pass, save for the JLD code which we need to replace.

Tests fail on multi-GPU setups, ~~and segfault at the end due to a stream destructor~~. I think we need to rework all the handles and runtime stuff to do things the CuArrays way, e.g. through CUDAdrv. I'm not sure what our multi-gpu or stream story is in terms of cublas handles etc though.

In any case we mainly want to do benchmarking to begin with, and this should be ready enough for that.

denizyuret · 2018-12-14T16:41:13Z

Tests pass on MIT supercloud as well. @ekinakyurek, Can you take this branch (git checkout cuarrays) and do some memory/time benchmarks on your big models comparing with master? The goal is to optimize the allocation strategy of cuarrays and make it as good or better than Knet if not already. We will test/optimize the kernels in the next step: this branch is mostly using Knet kernels now.

ekinakyurek · 2018-12-14T23:31:25Z

I am looking for it, will update you

ekinakyurek · 2018-12-15T01:09:23Z

I tested on supercloud which has Nvidia V100 GPUs.

The model that I tested can be found here. It has 2 convolutional layer, 1 bi-lstm layer and 12 recurrent cell which has many dense/linear layers which attend to outputs of convlutional and bisltm layers and there are elementwise-multiplication/vcat/hcat/permutedims operations as well.

Total GPU memory usage after 1 epoch can be found below (0->cuarrays, 1->knetarrays):

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    163228      C   .../gridsan/eakyurek/julia-1.0.1/bin/julia 10374MiB |
|    1    183653      C   .../gridsan/eakyurek/julia-1.0.1/bin/julia  8536MiB |
+-----------------------------------------------------------------------------+

Progresmeter results after 1 epoch:

knetarrays:

100.00%┣██████████████████████████████████████████████████████┫ 10938/10938 [39:18<00:00 , 4.64 it/s]

cuarrays

100.00%┣██████████████████████████████████████████████████████┫ 10938/10938 [56:23<00:00 , 3.23 it/s]

denizyuret · 2018-12-15T13:12:17Z

@maleadt : this shows about 40% speed and 20% memory regression when switched from KnetArray to CuArrays allocation. Any ideas?

@ekinakyurek : did you do any initial warm-up iterations to ignore the difference due to compiling?

@MikeInnes: does this version have the same behavior with Knet in early garbage collecting?

maleadt · 2018-12-15T13:29:12Z

40% speed and 20% memory regression

Not sure the memory regression is telling anything, CuArrays might just be holding on to allocations a little longer (I don't know how the Knet pooling allocator works), or this is due to the recent "large buffer bailout" (keeping large buffers outside of the pool). The runtime regression is obviously pretty serious, this would be a useful test-case to take some timings with and spot inefficiencies or tune the algorithm. Any instructions to reproduce these measurements? Although I won't have much time to spend at this.

ekinakyurek · 2018-12-15T17:13:34Z

I pointed the repo to replicate the experiment in the first comment. There is a very short README.md(use option b for data) to setup package and download data. There is also job.sh to run the experiment.

However, only challenge is that you need to have ~70GB RAM to run this. Because loading image data from file made the code very slow, I setup it in this way. I couldn't find a fast way to load parallel data. I also opened issue (JuliaIO/HDF5.jl#518) about that a while ago.

denizyuret · 2018-12-15T17:29:24Z

Can't we setup a test with artificial data to decrease the barrier to entry? A simple script that doesn't load anything but still goes through the motions to give memory/time benchmarks.

…

On Sat, Dec 15, 2018, 12:13 PM Ekin Akyürek ***@***.***> wrote: I pointed the repo to replicate the experiment in the first comment. There is a very short README.md <https://github.com/ekinakyurek/Mac-Network-Knet/blob/master/README.md> to setup package and download data. There is also job.sh <https://github.com/ekinakyurek/Mac-Network-Knet/blob/master/job.sh> to run the experiment. However, only challenge is that you need to have ~70GB RAM to run this. Because loading image data from file made the code very slow, I setup it in this way. I couldn't find a fast way to load parallel data. I also opened issue (JuliaIO/HDF5.jl#518 <JuliaIO/HDF5.jl#518>) about that a while ago. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#417 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpuD36LJmJeKHMIMd9nIo1OaUlZelks5u5S2_gaJpZM4ZS2h9> .

ekinakyurek · 2018-12-15T18:03:21Z

Okey, I added benchmark.jl to test and you don't need to install data.

The benchmark is not exact since RNN sequence sizes are same in a batch.

https://github.com/ekinakyurek/Mac-Network-Knet/blob/master/benchmark.jl

MikeInnes · 2018-12-15T19:29:28Z

Not sure how large your arrays are, but the pooling slowdown is likely down to us not pooling large arrays. You can tweak that here to test that (it'll probably make peak usage worse). If we can confirm that then it's easy enough for me to fix it.

@maleadt do you have any thoughts or warnings about using multiple GPUs and blas handles etc? Right now the tests fail e.g. on cyclops for this reason, but it's a little unclear to me what the bad interactions are. Might be worth trying to reproduce that and see the error.

ekinakyurek · 2019-02-05T20:01:31Z

I was just training the same network and new timing on current master of Knet:
100.00%┣███████████████████████████┫ 10938/10938 [32:17/32:17, 5.65i/s]

denizyuret · 2019-02-05T20:31:09Z

I was just training the same network and new timing on current master of Knet:
100.00%┣███████████████████████████┫ 10938/10938 [32:17/32:17, 5.65i/s]

Is this for KnetArrays or CuArrays?

MikeInnes · 2019-02-05T21:29:29Z

I just tried with a single forwards/backwards pass of @ekinakyurek's benchmark and found both master and this branch to be around 12s. (This is on the GTX 1080 Ti on Cyclops, and straight @time as opposed to anything fancy like BenchmarkTools.)

Have I done something dumb here, or does the perf difference only show up with more iterations, or something else?

ekinakyurek · 2019-02-06T00:57:47Z

I believe there is a mistake in your setup, or GTX 1080 Ti doesn't work well with both branches.

Here is the benchmark results in a Tesla V100 machine with Julia1.1.0.

Knet#master

julia> @time benchmark(M,o;N=100)
 29.891977 seconds (11.12 M allocations: 5.232 GiB, 7.43% gc time)

julia> @time benchmark(M,o;N=100)
 24.664394 seconds (11.12 M allocations: 5.232 GiB, 6.71% gc time)

julia> @time benchmark(M,o;N=1000)
244.700195 seconds (111.28 M allocations: 52.324 GiB, 6.23% gc time)

Knet#cuarrays

julia> @time benchmark(M,o;N=100)
 35.008532 seconds (18.46 M allocations: 10.327 GiB, 10.82% gc time)

julia> @time benchmark(M,o;N=100)
 33.461694 seconds (18.45 M allocations: 10.326 GiB, 11.60% gc time)

julia> @time benchmark(M,o;N=1000)
329.063674 seconds (184.61 M allocations: 103.264 GiB, 10.29% gc time)

Note: As I said earlier, these results are not realistic because we use unequal sequence sizes in the real data which is more challenging for memory management.

ekinakyurek · 2019-02-06T06:36:27Z

I was just training the same network and new timing on current master of Knet:
100.00%┣███████████████████████████┫ 10938/10938 [32:17/32:17, 5.65i/s]

Is this for KnetArrays or CuArrays?

It was KnetArrays, but there can be a mistake because I was transitioning the code. So, I believe that's why above result is better than previous ones. I will update you

ekinakyurek · 2019-02-06T15:04:03Z

I was just training the same network and new timing on current master of Knet:
100.00%┣███████████████████████████┫ 10938/10938 [32:17/32:17, 5.65i/s]

Is this for KnetArrays or CuArrays?

It was KnetArrays, but there can be a mistake because I was transitioning the code. So, I believe that's why above result is better than previous ones. I will update you

@denizyuret the first benchmarks are still same. I mean 1 epoch ~39minutes. Sorry, for the confusion

maleadt · 2019-02-07T15:31:52Z

How much memory does this benchmark require? It OOMs on both my 4GB GTX970 and 6GB Titan...

denizyuret · 2019-02-07T15:42:00Z

If I remember correctly it works on K80 (12GB), we tested on V100 (16GB). With good memory management it prohably will run on 8GB (Ekin can check the pytorch impl as reference). This was the model that forced me to add active gc in backward pass for KnetArrays.

…

On Thu, Feb 7, 2019 at 6:31 PM Tim Besard ***@***.***> wrote: How much memory does this benchmark require? It OOMs on both my 4GB GTX970 and 6GB Titan... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#417 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNptbAdUogw1gARL5bYqrZIoKRXJe_ks5vLEbpgaJpZM4ZS2h9> .

maleadt · 2019-02-07T16:13:35Z

Got it working on a different system.

CuArrays.jl (cuarrays branch)

julia> @time benchmark(M,o;N=100)
 35.288962 seconds (19.78 M allocations: 10.385 GiB, 4.16% gc time)

julia> @time benchmark(M,o;N=100)
 34.870759 seconds (19.78 M allocations: 10.385 GiB, 4.22% gc time)

julia> @time benchmark(M,o;N=100)
 34.796848 seconds (19.77 M allocations: 10.385 GiB, 4.25% gc time)

Regular Knet (v0.8.2-803-g0a2320f8, closest master branch commit)

julia> @time benchmark(M,o;N=100)
 66.928443 seconds (11.12 M allocations: 5.232 GiB, 50.40% gc time)

julia> @time benchmark(M,o;N=100)
 68.562458 seconds (11.12 M allocations: 5.232 GiB, 51.33% gc time)

So yeah... This is on a GTX 2080 with 11GB RAM, Julia 1.1.

denizyuret · 2019-02-07T16:19:01Z

Interesting -- Knet seems to do half the allocation (both in number and total size) but spend 10x more time on GC. Ekin, do you have any results comparing the two branches?

…

On Thu, Feb 7, 2019 at 7:13 PM Tim Besard ***@***.***> wrote: Got it working on a different system. CuArrays.jl (cuarrays branch) julia> @time benchmark(M,o;N=100) 35.288962 seconds (19.78 M allocations: 10.385 GiB, 4.16% gc time) julia> @time benchmark(M,o;N=100) 34.870759 seconds (19.78 M allocations: 10.385 GiB, 4.22% gc time) julia> @time benchmark(M,o;N=100) 34.796848 seconds (19.77 M allocations: 10.385 GiB, 4.25% gc time) Regular Knet (v0.8.2-803-g0a2320f8, closest master branch commit) julia> @time benchmark(M,o;N=100) 66.928443 seconds (11.12 M allocations: 5.232 GiB, 50.40% gc time) julia> @time benchmark(M,o;N=100) 68.562458 seconds (11.12 M allocations: 5.232 GiB, 51.33% gc time) So yeah... This is on a GTX 2080 with 8GB RAM, Julia 1.1. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#417 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpttOWTVvkshcPAMiq9QxCTLJ4ewUks5vLFCwgaJpZM4ZS2h9> .

ekinakyurek · 2019-02-07T16:24:55Z

Yes, I shared them in below. @maleadt Knet master in v1.2.0+ version. How did you use v0.8.2 ?

I believe there is a mistake in your setup, or GTX 1080 Ti doesn't work well with both branches.

Here is the benchmark results in a Tesla V100 machine with Julia1.1.0.

Knet#master
julia> @time benchmark(M,o;N=100)
 29.891977 seconds (11.12 M allocations: 5.232 GiB, 7.43% gc time)

julia> @time benchmark(M,o;N=100)
 24.664394 seconds (11.12 M allocations: 5.232 GiB, 6.71% gc time)

julia> @time benchmark(M,o;N=1000)
244.700195 seconds (111.28 M allocations: 52.324 GiB, 6.23% gc time)
Knet#cuarrays
julia> @time benchmark(M,o;N=100)
 35.008532 seconds (18.46 M allocations: 10.327 GiB, 10.82% gc time)

julia> @time benchmark(M,o;N=100)
 33.461694 seconds (18.45 M allocations: 10.326 GiB, 11.60% gc time)

julia> @time benchmark(M,o;N=1000)
329.063674 seconds (184.61 M allocations: 103.264 GiB, 10.29% gc time)
Note: As I said earlier, these results are not realistic because we use unequal sequence sizes in the real data which is more challenging for memory management.

maleadt · 2019-02-07T16:28:46Z

Hmm now it takes over 100 seconds for both implementations; something seems off.
I'll do some proper investigation and report those timings tomorrow or next week.

maleadt · 2019-02-07T16:43:47Z

FWIW, I could reproduce the timings once more on proper branches (cuarrays and 0a2320f8381699ac9e2ceb012037afb22748296f as the merge-base of cuarrays and master):

CuArray

julia> @time benchmark(M,o;N=100)
 33.988073 seconds (19.77 M allocations: 10.385 GiB, 4.24% gc time)

julia> @time benchmark(M,o;N=100)
 35.941523 seconds (19.77 M allocations: 10.385 GiB, 4.10% gc time)

Knet

julia> @time benchmark(M,o;N=100)
 64.806681 seconds (11.12 M allocations: 5.232 GiB, 49.43% gc time)

julia> @time benchmark(M,o;N=100)
 62.485524 seconds (11.12 M allocations: 5.232 GiB, 48.08% gc time)

I've never seen Knet perform better, but CuArrays took over 100s at one point. Not sure where that came from, does the model have any nondeterminism?

Also, with CuArrays we can inspect timings more in detail, and GC time seems reasonable:

julia> CuArrays.@time benchmark(M,o;N=100)
 35.629547 seconds (19.78 M CPU allocations: 10.385 GiB, 4.35% gc time) (172.40 k GPU allocations: 734.200 GiB, 0.45% gc time of which 100.00% spent allocating)

maleadt · 2019-02-07T16:48:35Z

I've pushed some fixes for the latest CuArrays/CUDAdrv/CUDAnative. Use CUDAdrv/CUDAnative from master and CuArrays from JuliaGPU/CuArrays.jl#275.
Again, I'll try to do some more robust measurements later.

ekinakyurek · 2019-02-07T17:22:53Z

Let me tell that 8GB GPU is a good test environment for stress testing. So, 12GB GPU's would be better to test for normal conditions (assuming ML development environments mostly have 12GB).

I also did tests in Nvidia K80, so results are different again:

knet#master

julia> @time benchmark(M,o;N=100)
 65.345801 seconds (11.11 M allocations: 5.232 GiB, 0.58% gc time)

Knet#cuarrays

julia> @time benchmark(M,o;N=100)
 79.009186 seconds (18.44 M allocations: 10.326 GiB, 1.38% gc time)

Do you test with last updates to CuArrays? If so, how can I make my Knet#cuarrays branch to use latest CuArrays?

ekinakyurek · 2019-02-07T17:27:24Z

Okey, I am checking out to the lates branches of CuArrays and will try again

ekinakyurek · 2019-02-07T20:41:12Z

I run the script, which is provided by @maleadt, on an AWS-p2.xlarge(a K80 GPU) instance with the AMI Image(ami-01df61498d474ecd2).

Here is the script
Here is the log

Here is the results:

Knet#cuarrays

 80.332260 seconds (19.77 M allocations: 10.385 GiB, 1.82% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 80.397125 seconds (19.77 M allocations: 10.385 GiB, 1.83% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 80.449783 seconds (19.77 M allocations: 10.385 GiB, 1.84% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing

Knet#master

68.879578 seconds (11.11 M allocations: 5.232 GiB, 0.96% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 68.964452 seconds (11.12 M allocations: 5.232 GiB, 0.75% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 69.143434 seconds (11.12 M allocations: 5.232 GiB, 0.76% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing

maleadt · 2019-02-07T21:04:04Z

Great, thanks. That matches my most recent measurements, and leaves me stumped how I ended up with those (consistently reproducible within that session) measurements in #417 (comment)
Anyhow, now that I've got the benchmark working I'll do some digging.

MikeInnes · 2019-02-08T10:26:24Z

Self-reminder that this needs to be updated for JuliaGPU/CuArrays.jl#275.

maleadt · 2019-02-08T12:36:52Z

Self-reminder that this needs to be updated for JuliaGPU/CuArrays.jl#275.

I already pushed fixes onto this branch for JuliaGPU/CUDAdrv.jl#125 and JuliaGPU/CuArrays.jl#275.

I've been doing some optimizations of the CuArrays allocator in JuliaGPU/CuArrays.jl#277, resulting in the following timings on my 2080Ti.

# plain Knet
 66.937098 seconds (11.27 M allocations: 5.240 GiB, 49.99% gc time)
 67.093311 seconds (11.27 M allocations: 5.240 GiB, 49.75% gc time)
 65.786219 seconds (11.27 M allocations: 5.240 GiB, 48.31% gc time)

# Knet + CuArrays
 49.807687 seconds (18.95 M allocations: 10.387 GiB, 32.29% gc time)
 48.358189 seconds (18.95 M allocations: 10.387 GiB, 30.80% gc time)
 47.869683 seconds (18.95 M allocations: 10.387 GiB, 30.16% gc time)
 50.628463 seconds (18.95 M CPU allocations: 10.387 GiB, 32.64% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.15% gc time)

What's curious here is the increase in number of CPU allocations -- probably because of nesting a non-isbits (ie. needs to be allocated) CuArray within KNetArray?

In one of the runs, Knet+CuArrays triggered the fast execution from #417 (comment) again, so I'm also suspecting some nondeterminism somewhere.

 35.302870 seconds (18.85 M allocations: 10.386 GiB, 5.95% gc time)
 36.575237 seconds (18.85 M allocations: 10.386 GiB, 5.94% gc time)
 35.099538 seconds (18.85 M allocations: 10.386 GiB, 5.57% gc time)
 34.825517 seconds (18.85 M CPU allocations: 10.386 GiB, 5.33% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.52% gc time)

On cyclops meanwhile, Knet+CuArrays still performs worse:

# Knet
 37.998717 seconds (11.24 M allocations: 5.239 GiB, 1.88% gc time)
 38.056858 seconds (11.25 M allocations: 5.239 GiB, 1.61% gc time)
 38.020132 seconds (11.25 M allocations: 5.239 GiB, 1.53% gc time)

# Knet + CuArrays
 52.888176 seconds (18.84 M allocations: 10.386 GiB, 5.22% gc time)
 52.927398 seconds (18.84 M allocations: 10.386 GiB, 5.21% gc time)
 52.840697 seconds (18.84 M allocations: 10.386 GiB, 5.16% gc time)
 53.060552 seconds (18.84 M CPU allocations: 10.386 GiB, 5.20% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.55% gc time)

@ekinakyurek, could you have a run with CuArrays#tb/opt_pool?

maleadt · 2019-02-08T15:07:32Z

The benchmark above now consistently performs faster on my 2080Ti with Knet+CuArrays:

# Knet + CuArrays
 48.678508 seconds (18.95 M allocations: 10.388 GiB, 31.46% gc time)
 49.486688 seconds (18.95 M CPU allocations: 10.387 GiB, 31.68% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.55% gc time)
 48.685517 seconds (18.95 M allocations: 10.388 GiB, 31.23% gc time)
 48.427788 seconds (18.95 M CPU allocations: 10.387 GiB, 31.17% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.14% gc time)
 48.711011 seconds (18.95 M allocations: 10.388 GiB, 30.84% gc time)
 46.839294 seconds (18.95 M CPU allocations: 10.387 GiB, 29.93% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.23% gc time)
 49.719577 seconds (18.95 M allocations: 10.388 GiB, 31.53% gc time)
 48.216161 seconds (18.95 M CPU allocations: 10.387 GiB, 31.15% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.15% gc time)
 48.006769 seconds (18.95 M allocations: 10.388 GiB, 29.61% gc time)
 48.881215 seconds (18.95 M CPU allocations: 10.387 GiB, 31.61% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.22% gc time)
 48.478350 seconds (18.95 M allocations: 10.388 GiB, 31.35% gc time)
 49.210618 seconds (18.95 M CPU allocations: 10.387 GiB, 31.54% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.21% gc time)
 50.478621 seconds (18.95 M allocations: 10.388 GiB, 32.14% gc time)
 47.736331 seconds (18.95 M CPU allocations: 10.387 GiB, 30.36% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.37% gc time)
 48.819236 seconds (18.95 M allocations: 10.388 GiB, 31.20% gc time)
 48.687014 seconds (18.95 M CPU allocations: 10.387 GiB, 31.37% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.23% gc time)
 48.725251 seconds (18.95 M allocations: 10.388 GiB, 31.46% gc time)
 48.567826 seconds (18.95 M CPU allocations: 10.387 GiB, 31.13% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.16% gc time)
 48.865095 seconds (18.95 M allocations: 10.388 GiB, 31.42% gc time)
 48.436326 seconds (18.95 M CPU allocations: 10.387 GiB, 31.14% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.23% gc time)
 47.915530 seconds (18.95 M allocations: 10.388 GiB, 30.88% gc time)
 48.656057 seconds (18.95 M CPU allocations: 10.387 GiB, 31.47% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.18% gc time)
 47.516082 seconds (18.95 M allocations: 10.388 GiB, 30.84% gc time)
 48.070126 seconds (18.95 M CPU allocations: 10.387 GiB, 31.23% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.15% gc time)
 50.336792 seconds (18.95 M allocations: 10.388 GiB, 29.68% gc time)
 49.260103 seconds (18.95 M CPU allocations: 10.387 GiB, 31.33% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.15% gc time)
 50.539572 seconds (18.95 M allocations: 10.388 GiB, 30.46% gc time)
 49.561482 seconds (18.95 M CPU allocations: 10.387 GiB, 31.52% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.04% gc time)
 48.672160 seconds (18.95 M allocations: 10.388 GiB, 31.44% gc time)

# Knet (was pretty consistent already, so fewer measurements)
 63.134642 seconds (11.27 M allocations: 5.240 GiB, 47.19% gc time)
 65.263926 seconds (11.26 M allocations: 5.239 GiB, 48.70% gc time)
 63.749779 seconds (11.26 M allocations: 5.239 GiB, 47.25% gc time)
 65.495878 seconds (11.26 M allocations: 5.239 GiB, 48.30% gc time)
 64.180856 seconds (11.26 M allocations: 5.239 GiB, 47.76% gc time)
 65.591660 seconds (11.26 M allocations: 5.239 GiB, 48.90% gc time)
 66.132638 seconds (11.26 M allocations: 5.239 GiB, 49.10% gc time)

ekinakyurek · 2019-02-15T23:32:45Z

Hi @maleadt, @MikeInnes , @denizyuret

I benchmarked again and the results are still similar with the older experiments that I run on K80 and V100 GPU machines. I think you somehow overfit the allocation strategy to 2080Ti or something else.

AWS-P2Xlarge-K80 Benchmark Info

Here is the updated script
Here is the updated log

knet#cuarrays

 78.341572 seconds (18.58 M allocations: 10.364 GiB, 1.42% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 78.549632 seconds (18.58 M allocations: 10.364 GiB, 1.41% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 78.663026 seconds (18.58 M allocations: 10.364 GiB, 1.42% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing

knet#master

arrtype = KnetArray{Float32,N} where N
 68.307643 seconds (11.11 M allocations: 5.232 GiB, 0.75% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 68.499359 seconds (11.12 M allocations: 5.232 GiB, 0.61% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 68.678919 seconds (11.12 M allocations: 5.232 GiB, 0.61% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing

Note:

I believe if you could explain a bit what you updated in CuArrays after benchmarks and how it differs than Knet's allocation strategy or how is different than previous version of CuArrays that would be more helpful for everybody.

maleadt · 2019-02-16T08:07:46Z

I think you somehow overfit the allocation strategy to 2080Ti or something else.

There's not much to fit though, the allocator hasn't got any params (except for the idle-time memory freeing, but that isn't relevant here). The only thing I can imagine is whether we first try to CUDA.alloc or GC.gc(false); the performance between both might differ with the CUDA version / application type (amount of allocated objects) / CPU performance.

I believe if you could explain a bit what you updated in CuArrays after benchmarks and how it differs than Knet's allocation strategy or how is different than previous version of CuArrays that would be more helpful for everybody.

The allocator has been refactored and its behavior is now pretty clear (IMO) from the main state machine's code: https://github.com/JuliaGPU/CuArrays.jl/blob/f5cbcfcf329b7e15a9bbbf0f344c5613a83e7148/src/memory.jl#L240-L330

Furthermore, I added timer outputs which yield an output like this when calling CuArrays.pool_timings():

 ───────────────────────────────────────────────────────────────────────────────────
                                            Time                   Allocations      
                                    ──────────────────────   ───────────────────────
          Tot / % measured:              5.25s / 54.1%            389MiB / 2.28%    

 Section                    ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────────
 pooled alloc                1.25k    2.84s   100%  2.26ms   8.86MiB  100%   7.24KiB
   step 4: reclaim unused       63    1.28s  45.3%  20.4ms   28.2KiB  0.31%        -
     reclaim                    63    299ms  10.5%  4.75ms         -  0.00%        -
     scan                       63   84.3μs  0.00%  1.34μs   20.7KiB  0.23%        -
   step 5: gc(true)              3    513ms  18.1%   171ms    310KiB  3.42%   103KiB
   step 3: gc(false)            93    485ms  17.1%  5.22ms    162KiB  1.78%  1.74KiB
   step 2: try alloc           656    474ms  16.7%   723μs   38.9KiB  0.43%        -
   step 1: check pool        1.25k    167μs  0.01%   133ns         -  0.00%        -
 scan                            1   1.60μs  0.00%  1.60μs         -  0.00%        -
 reclaim                         1   1.45μs  0.00%  1.45μs         -  0.00%        -
 ───────────────────────────────────────────────────────────────────────────────────

Adding similar timings to Knet should make it possible to compare behavior, both in terms of how many allocations happen and how much time is spent at them.

MikeInnes · 2019-02-19T15:24:48Z

@ekinakyurek do you have any timings comparing this with the max_pool setting I pointed out earlier? Would be useful to know if disabling that gives you any speedup on your setup.

MikeInnes · 2019-02-23T11:07:41Z

@Keno is looking at GPU performance now, so might have some idea of what the low hanging fruit is.

ekinakyurek · 2019-02-23T17:02:57Z

@MikeInnes sorry for the late reply, I am going to look at it today.

Keno · 2019-02-23T17:21:32Z

Still looking into it. Hopefully I can just fix the low hanging fruits and things will magically get faster.

ekinakyurek · 2019-02-23T20:37:17Z

@MikeInnes
CuArrays.MAX_POOL =10000*1024^2 performed slightly worse,
CuArrays.MAX_POOL =1000*1024^2 performed as same with my old measurements,
CuArrays.MAX_POOL =100*1024^2 original one,
CuArrays.MAX_POOL =10*1024^2 performed badly,

denizyuret · 2020-09-01T11:19:05Z

1.4 has CuArrays fully integrated.

MikeInnes and others added 6 commits December 13, 2018 23:02

Kptr -> CuArray

24ec04a

fix setindex return

5e38dda

early free

3cd8db0

Fix for getindex -> view.

796204e

fix pointer offsets

d8fde7d

enable rnn tests

39c5b05

MikeInnes mentioned this pull request Dec 14, 2018

using CuArrays #396

Closed

3 tasks

MikeInnes added 2 commits February 5, 2019 14:28

update deps

c0db206

Merge branch 'master' into cuarrays

d6553b7

Fixes.

5304b2e

denizyuret closed this Sep 1, 2020

denizyuret deleted the cuarrays branch September 1, 2020 11:19

using CuArrays #417

using CuArrays #417

Conversation

MikeInnes commented Dec 14, 2018 • edited Loading

denizyuret commented Dec 14, 2018

ekinakyurek commented Dec 14, 2018

ekinakyurek commented Dec 15, 2018

denizyuret commented Dec 15, 2018

maleadt commented Dec 15, 2018

ekinakyurek commented Dec 15, 2018 • edited Loading

denizyuret commented Dec 15, 2018 via email

ekinakyurek commented Dec 15, 2018

MikeInnes commented Dec 15, 2018

ekinakyurek commented Feb 5, 2019

denizyuret commented Feb 5, 2019

MikeInnes commented Feb 5, 2019 • edited Loading

ekinakyurek commented Feb 6, 2019

ekinakyurek commented Feb 6, 2019 • edited Loading

ekinakyurek commented Feb 6, 2019

maleadt commented Feb 7, 2019

denizyuret commented Feb 7, 2019 via email

maleadt commented Feb 7, 2019 • edited Loading

denizyuret commented Feb 7, 2019 via email

ekinakyurek commented Feb 7, 2019 • edited Loading

maleadt commented Feb 7, 2019

maleadt commented Feb 7, 2019

maleadt commented Feb 7, 2019 • edited Loading

ekinakyurek commented Feb 7, 2019

ekinakyurek commented Feb 7, 2019

ekinakyurek commented Feb 7, 2019

maleadt commented Feb 7, 2019

MikeInnes commented Feb 8, 2019

maleadt commented Feb 8, 2019

maleadt commented Feb 8, 2019 • edited Loading

ekinakyurek commented Feb 15, 2019

AWS-P2Xlarge-K80 Benchmark Info

Note:

maleadt commented Feb 16, 2019

MikeInnes commented Feb 19, 2019

MikeInnes commented Feb 23, 2019

ekinakyurek commented Feb 23, 2019

Keno commented Feb 23, 2019

ekinakyurek commented Feb 23, 2019 • edited Loading

denizyuret commented Sep 1, 2020

MikeInnes commented Dec 14, 2018 •

edited

Loading

ekinakyurek commented Dec 15, 2018 •

edited

Loading

MikeInnes commented Feb 5, 2019 •

edited

Loading

ekinakyurek commented Feb 6, 2019 •

edited

Loading

maleadt commented Feb 7, 2019 •

edited

Loading

ekinakyurek commented Feb 7, 2019 •

edited

Loading

maleadt commented Feb 7, 2019 •

edited

Loading

maleadt commented Feb 8, 2019 •

edited

Loading

ekinakyurek commented Feb 23, 2019 •

edited

Loading