Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using CuArrays #417

Closed
wants to merge 9 commits into from
Closed

using CuArrays #417

wants to merge 9 commits into from

Conversation

MikeInnes
Copy link
Collaborator

@MikeInnes MikeInnes commented Dec 14, 2018

All tests now pass, save for the JLD code which we need to replace.

Tests fail on multi-GPU setups, and segfault at the end due to a stream destructor. I think we need to rework all the handles and runtime stuff to do things the CuArrays way, e.g. through CUDAdrv. I'm not sure what our multi-gpu or stream story is in terms of cublas handles etc though.

In any case we mainly want to do benchmarking to begin with, and this should be ready enough for that.

@MikeInnes MikeInnes mentioned this pull request Dec 14, 2018
3 tasks
@denizyuret
Copy link
Owner

Tests pass on MIT supercloud as well. @ekinakyurek, Can you take this branch (git checkout cuarrays) and do some memory/time benchmarks on your big models comparing with master? The goal is to optimize the allocation strategy of cuarrays and make it as good or better than Knet if not already. We will test/optimize the kernels in the next step: this branch is mostly using Knet kernels now.

@ekinakyurek
Copy link
Collaborator

I am looking for it, will update you

@ekinakyurek
Copy link
Collaborator

I tested on supercloud which has Nvidia V100 GPUs.

The model that I tested can be found here. It has 2 convolutional layer, 1 bi-lstm layer and 12 recurrent cell which has many dense/linear layers which attend to outputs of convlutional and bisltm layers and there are elementwise-multiplication/vcat/hcat/permutedims operations as well.

  1. Total GPU memory usage after 1 epoch can be found below (0->cuarrays, 1->knetarrays):
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    163228      C   .../gridsan/eakyurek/julia-1.0.1/bin/julia 10374MiB |
|    1    183653      C   .../gridsan/eakyurek/julia-1.0.1/bin/julia  8536MiB |
+-----------------------------------------------------------------------------+
  1. Progresmeter results after 1 epoch:

knetarrays:

100.00%┣██████████████████████████████████████████████████████┫ 10938/10938 [39:18<00:00 , 4.64 it/s]

cuarrays

100.00%┣██████████████████████████████████████████████████████┫ 10938/10938 [56:23<00:00 , 3.23 it/s]

@denizyuret
Copy link
Owner

@maleadt : this shows about 40% speed and 20% memory regression when switched from KnetArray to CuArrays allocation. Any ideas?

@ekinakyurek : did you do any initial warm-up iterations to ignore the difference due to compiling?

@MikeInnes: does this version have the same behavior with Knet in early garbage collecting?

@maleadt
Copy link
Collaborator

maleadt commented Dec 15, 2018

40% speed and 20% memory regression

Not sure the memory regression is telling anything, CuArrays might just be holding on to allocations a little longer (I don't know how the Knet pooling allocator works), or this is due to the recent "large buffer bailout" (keeping large buffers outside of the pool). The runtime regression is obviously pretty serious, this would be a useful test-case to take some timings with and spot inefficiencies or tune the algorithm. Any instructions to reproduce these measurements? Although I won't have much time to spend at this.

@ekinakyurek
Copy link
Collaborator

ekinakyurek commented Dec 15, 2018

I pointed the repo to replicate the experiment in the first comment. There is a very short README.md(use option b for data) to setup package and download data. There is also job.sh to run the experiment.

However, only challenge is that you need to have ~70GB RAM to run this. Because loading image data from file made the code very slow, I setup it in this way. I couldn't find a fast way to load parallel data. I also opened issue (JuliaIO/HDF5.jl#518) about that a while ago.

@denizyuret
Copy link
Owner

denizyuret commented Dec 15, 2018 via email

@ekinakyurek
Copy link
Collaborator

Okey, I added benchmark.jl to test and you don't need to install data.

  • The benchmark is not exact since RNN sequence sizes are same in a batch.

https://github.com/ekinakyurek/Mac-Network-Knet/blob/master/benchmark.jl

@MikeInnes
Copy link
Collaborator Author

Not sure how large your arrays are, but the pooling slowdown is likely down to us not pooling large arrays. You can tweak that here to test that (it'll probably make peak usage worse). If we can confirm that then it's easy enough for me to fix it.

@maleadt do you have any thoughts or warnings about using multiple GPUs and blas handles etc? Right now the tests fail e.g. on cyclops for this reason, but it's a little unclear to me what the bad interactions are. Might be worth trying to reproduce that and see the error.

@ekinakyurek
Copy link
Collaborator

I was just training the same network and new timing on current master of Knet:
100.00%┣███████████████████████████┫ 10938/10938 [32:17/32:17, 5.65i/s]

@denizyuret
Copy link
Owner

I was just training the same network and new timing on current master of Knet:
100.00%┣███████████████████████████┫ 10938/10938 [32:17/32:17, 5.65i/s]

Is this for KnetArrays or CuArrays?

@MikeInnes
Copy link
Collaborator Author

MikeInnes commented Feb 5, 2019

I just tried with a single forwards/backwards pass of @ekinakyurek's benchmark and found both master and this branch to be around 12s. (This is on the GTX 1080 Ti on Cyclops, and straight @time as opposed to anything fancy like BenchmarkTools.)

Have I done something dumb here, or does the perf difference only show up with more iterations, or something else?

@ekinakyurek
Copy link
Collaborator

I believe there is a mistake in your setup, or GTX 1080 Ti doesn't work well with both branches.

Here is the benchmark results in a Tesla V100 machine with Julia1.1.0.

Knet#master

julia> @time benchmark(M,o;N=100)
 29.891977 seconds (11.12 M allocations: 5.232 GiB, 7.43% gc time)

julia> @time benchmark(M,o;N=100)
 24.664394 seconds (11.12 M allocations: 5.232 GiB, 6.71% gc time)

julia> @time benchmark(M,o;N=1000)
244.700195 seconds (111.28 M allocations: 52.324 GiB, 6.23% gc time)

Knet#cuarrays

julia> @time benchmark(M,o;N=100)
 35.008532 seconds (18.46 M allocations: 10.327 GiB, 10.82% gc time)

julia> @time benchmark(M,o;N=100)
 33.461694 seconds (18.45 M allocations: 10.326 GiB, 11.60% gc time)

julia> @time benchmark(M,o;N=1000)
329.063674 seconds (184.61 M allocations: 103.264 GiB, 10.29% gc time)

Note: As I said earlier, these results are not realistic because we use unequal sequence sizes in the real data which is more challenging for memory management.

@ekinakyurek
Copy link
Collaborator

ekinakyurek commented Feb 6, 2019

I was just training the same network and new timing on current master of Knet:
100.00%┣███████████████████████████┫ 10938/10938 [32:17/32:17, 5.65i/s]

Is this for KnetArrays or CuArrays?

It was KnetArrays, but there can be a mistake because I was transitioning the code. So, I believe that's why above result is better than previous ones. I will update you

@ekinakyurek
Copy link
Collaborator

I was just training the same network and new timing on current master of Knet:
100.00%┣███████████████████████████┫ 10938/10938 [32:17/32:17, 5.65i/s]

Is this for KnetArrays or CuArrays?

It was KnetArrays, but there can be a mistake because I was transitioning the code. So, I believe that's why above result is better than previous ones. I will update you

@denizyuret the first benchmarks are still same. I mean 1 epoch ~39minutes. Sorry, for the confusion

@maleadt
Copy link
Collaborator

maleadt commented Feb 7, 2019

How much memory does this benchmark require? It OOMs on both my 4GB GTX970 and 6GB Titan...

@denizyuret
Copy link
Owner

denizyuret commented Feb 7, 2019 via email

@maleadt
Copy link
Collaborator

maleadt commented Feb 7, 2019

Got it working on a different system.

CuArrays.jl (cuarrays branch)

julia> @time benchmark(M,o;N=100)
 35.288962 seconds (19.78 M allocations: 10.385 GiB, 4.16% gc time)

julia> @time benchmark(M,o;N=100)
 34.870759 seconds (19.78 M allocations: 10.385 GiB, 4.22% gc time)

julia> @time benchmark(M,o;N=100)
 34.796848 seconds (19.77 M allocations: 10.385 GiB, 4.25% gc time)

Regular Knet (v0.8.2-803-g0a2320f8, closest master branch commit)

julia> @time benchmark(M,o;N=100)
 66.928443 seconds (11.12 M allocations: 5.232 GiB, 50.40% gc time)

julia> @time benchmark(M,o;N=100)
 68.562458 seconds (11.12 M allocations: 5.232 GiB, 51.33% gc time)

So yeah... This is on a GTX 2080 with 11GB RAM, Julia 1.1.

@denizyuret
Copy link
Owner

denizyuret commented Feb 7, 2019 via email

@ekinakyurek
Copy link
Collaborator

ekinakyurek commented Feb 7, 2019

Yes, I shared them in below. @maleadt Knet master in v1.2.0+ version. How did you use v0.8.2 ?

I believe there is a mistake in your setup, or GTX 1080 Ti doesn't work well with both branches.

Here is the benchmark results in a Tesla V100 machine with Julia1.1.0.

Knet#master

julia> @time benchmark(M,o;N=100)
 29.891977 seconds (11.12 M allocations: 5.232 GiB, 7.43% gc time)

julia> @time benchmark(M,o;N=100)
 24.664394 seconds (11.12 M allocations: 5.232 GiB, 6.71% gc time)

julia> @time benchmark(M,o;N=1000)
244.700195 seconds (111.28 M allocations: 52.324 GiB, 6.23% gc time)

Knet#cuarrays

julia> @time benchmark(M,o;N=100)
 35.008532 seconds (18.46 M allocations: 10.327 GiB, 10.82% gc time)

julia> @time benchmark(M,o;N=100)
 33.461694 seconds (18.45 M allocations: 10.326 GiB, 11.60% gc time)

julia> @time benchmark(M,o;N=1000)
329.063674 seconds (184.61 M allocations: 103.264 GiB, 10.29% gc time)

Note: As I said earlier, these results are not realistic because we use unequal sequence sizes in the real data which is more challenging for memory management.

@maleadt
Copy link
Collaborator

maleadt commented Feb 7, 2019

Hmm now it takes over 100 seconds for both implementations; something seems off.
I'll do some proper investigation and report those timings tomorrow or next week.

@maleadt
Copy link
Collaborator

maleadt commented Feb 7, 2019

FWIW, I could reproduce the timings once more on proper branches (cuarrays and 0a2320f8381699ac9e2ceb012037afb22748296f as the merge-base of cuarrays and master):

CuArray

julia> @time benchmark(M,o;N=100)
 33.988073 seconds (19.77 M allocations: 10.385 GiB, 4.24% gc time)

julia> @time benchmark(M,o;N=100)
 35.941523 seconds (19.77 M allocations: 10.385 GiB, 4.10% gc time)

Knet

julia> @time benchmark(M,o;N=100)
 64.806681 seconds (11.12 M allocations: 5.232 GiB, 49.43% gc time)

julia> @time benchmark(M,o;N=100)
 62.485524 seconds (11.12 M allocations: 5.232 GiB, 48.08% gc time)

I've never seen Knet perform better, but CuArrays took over 100s at one point. Not sure where that came from, does the model have any nondeterminism?

Also, with CuArrays we can inspect timings more in detail, and GC time seems reasonable:

julia> CuArrays.@time benchmark(M,o;N=100)
 35.629547 seconds (19.78 M CPU allocations: 10.385 GiB, 4.35% gc time) (172.40 k GPU allocations: 734.200 GiB, 0.45% gc time of which 100.00% spent allocating)

@maleadt
Copy link
Collaborator

maleadt commented Feb 7, 2019

I've pushed some fixes for the latest CuArrays/CUDAdrv/CUDAnative. Use CUDAdrv/CUDAnative from master and CuArrays from JuliaGPU/CuArrays.jl#275.
Again, I'll try to do some more robust measurements later.

@ekinakyurek
Copy link
Collaborator

Let me tell that 8GB GPU is a good test environment for stress testing. So, 12GB GPU's would be better to test for normal conditions (assuming ML development environments mostly have 12GB).

I also did tests in Nvidia K80, so results are different again:

knet#master

julia> @time benchmark(M,o;N=100)
 65.345801 seconds (11.11 M allocations: 5.232 GiB, 0.58% gc time)

Knet#cuarrays

julia> @time benchmark(M,o;N=100)
 79.009186 seconds (18.44 M allocations: 10.326 GiB, 1.38% gc time)

Do you test with last updates to CuArrays? If so, how can I make my Knet#cuarrays branch to use latest CuArrays?

@ekinakyurek
Copy link
Collaborator

Okey, I am checking out to the lates branches of CuArrays and will try again

@ekinakyurek
Copy link
Collaborator

I run the script, which is provided by @maleadt, on an AWS-p2.xlarge(a K80 GPU) instance with the AMI Image(ami-01df61498d474ecd2).

Here is the script
Here is the log

Here is the results:

Knet#cuarrays

 80.332260 seconds (19.77 M allocations: 10.385 GiB, 1.82% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 80.397125 seconds (19.77 M allocations: 10.385 GiB, 1.83% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 80.449783 seconds (19.77 M allocations: 10.385 GiB, 1.84% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing

Knet#master

68.879578 seconds (11.11 M allocations: 5.232 GiB, 0.96% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 68.964452 seconds (11.12 M allocations: 5.232 GiB, 0.75% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 69.143434 seconds (11.12 M allocations: 5.232 GiB, 0.76% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
  

@maleadt
Copy link
Collaborator

maleadt commented Feb 7, 2019

Great, thanks. That matches my most recent measurements, and leaves me stumped how I ended up with those (consistently reproducible within that session) measurements in #417 (comment)
Anyhow, now that I've got the benchmark working I'll do some digging.

@MikeInnes
Copy link
Collaborator Author

Self-reminder that this needs to be updated for JuliaGPU/CuArrays.jl#275.

@maleadt
Copy link
Collaborator

maleadt commented Feb 8, 2019

Self-reminder that this needs to be updated for JuliaGPU/CuArrays.jl#275.

I already pushed fixes onto this branch for JuliaGPU/CUDAdrv.jl#125 and JuliaGPU/CuArrays.jl#275.

I've been doing some optimizations of the CuArrays allocator in JuliaGPU/CuArrays.jl#277, resulting in the following timings on my 2080Ti.

# plain Knet
 66.937098 seconds (11.27 M allocations: 5.240 GiB, 49.99% gc time)
 67.093311 seconds (11.27 M allocations: 5.240 GiB, 49.75% gc time)
 65.786219 seconds (11.27 M allocations: 5.240 GiB, 48.31% gc time)

# Knet + CuArrays
 49.807687 seconds (18.95 M allocations: 10.387 GiB, 32.29% gc time)
 48.358189 seconds (18.95 M allocations: 10.387 GiB, 30.80% gc time)
 47.869683 seconds (18.95 M allocations: 10.387 GiB, 30.16% gc time)
 50.628463 seconds (18.95 M CPU allocations: 10.387 GiB, 32.64% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.15% gc time)

What's curious here is the increase in number of CPU allocations -- probably because of nesting a non-isbits (ie. needs to be allocated) CuArray within KNetArray?

In one of the runs, Knet+CuArrays triggered the fast execution from #417 (comment) again, so I'm also suspecting some nondeterminism somewhere.

 35.302870 seconds (18.85 M allocations: 10.386 GiB, 5.95% gc time)
 36.575237 seconds (18.85 M allocations: 10.386 GiB, 5.94% gc time)
 35.099538 seconds (18.85 M allocations: 10.386 GiB, 5.57% gc time)
 34.825517 seconds (18.85 M CPU allocations: 10.386 GiB, 5.33% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.52% gc time)

On cyclops meanwhile, Knet+CuArrays still performs worse:

# Knet
 37.998717 seconds (11.24 M allocations: 5.239 GiB, 1.88% gc time)
 38.056858 seconds (11.25 M allocations: 5.239 GiB, 1.61% gc time)
 38.020132 seconds (11.25 M allocations: 5.239 GiB, 1.53% gc time)

# Knet + CuArrays
 52.888176 seconds (18.84 M allocations: 10.386 GiB, 5.22% gc time)
 52.927398 seconds (18.84 M allocations: 10.386 GiB, 5.21% gc time)
 52.840697 seconds (18.84 M allocations: 10.386 GiB, 5.16% gc time)
 53.060552 seconds (18.84 M CPU allocations: 10.386 GiB, 5.20% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.55% gc time)

@ekinakyurek, could you have a run with CuArrays#tb/opt_pool?

@maleadt
Copy link
Collaborator

maleadt commented Feb 8, 2019

The benchmark above now consistently performs faster on my 2080Ti with Knet+CuArrays:

# Knet + CuArrays
 48.678508 seconds (18.95 M allocations: 10.388 GiB, 31.46% gc time)
 49.486688 seconds (18.95 M CPU allocations: 10.387 GiB, 31.68% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.55% gc time)
 48.685517 seconds (18.95 M allocations: 10.388 GiB, 31.23% gc time)
 48.427788 seconds (18.95 M CPU allocations: 10.387 GiB, 31.17% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.14% gc time)
 48.711011 seconds (18.95 M allocations: 10.388 GiB, 30.84% gc time)
 46.839294 seconds (18.95 M CPU allocations: 10.387 GiB, 29.93% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.23% gc time)
 49.719577 seconds (18.95 M allocations: 10.388 GiB, 31.53% gc time)
 48.216161 seconds (18.95 M CPU allocations: 10.387 GiB, 31.15% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.15% gc time)
 48.006769 seconds (18.95 M allocations: 10.388 GiB, 29.61% gc time)
 48.881215 seconds (18.95 M CPU allocations: 10.387 GiB, 31.61% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.22% gc time)
 48.478350 seconds (18.95 M allocations: 10.388 GiB, 31.35% gc time)
 49.210618 seconds (18.95 M CPU allocations: 10.387 GiB, 31.54% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.21% gc time)
 50.478621 seconds (18.95 M allocations: 10.388 GiB, 32.14% gc time)
 47.736331 seconds (18.95 M CPU allocations: 10.387 GiB, 30.36% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.37% gc time)
 48.819236 seconds (18.95 M allocations: 10.388 GiB, 31.20% gc time)
 48.687014 seconds (18.95 M CPU allocations: 10.387 GiB, 31.37% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.23% gc time)
 48.725251 seconds (18.95 M allocations: 10.388 GiB, 31.46% gc time)
 48.567826 seconds (18.95 M CPU allocations: 10.387 GiB, 31.13% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.16% gc time)
 48.865095 seconds (18.95 M allocations: 10.388 GiB, 31.42% gc time)
 48.436326 seconds (18.95 M CPU allocations: 10.387 GiB, 31.14% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.23% gc time)
 47.915530 seconds (18.95 M allocations: 10.388 GiB, 30.88% gc time)
 48.656057 seconds (18.95 M CPU allocations: 10.387 GiB, 31.47% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.18% gc time)
 47.516082 seconds (18.95 M allocations: 10.388 GiB, 30.84% gc time)
 48.070126 seconds (18.95 M CPU allocations: 10.387 GiB, 31.23% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.15% gc time)
 50.336792 seconds (18.95 M allocations: 10.388 GiB, 29.68% gc time)
 49.260103 seconds (18.95 M CPU allocations: 10.387 GiB, 31.33% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.15% gc time)
 50.539572 seconds (18.95 M allocations: 10.388 GiB, 30.46% gc time)
 49.561482 seconds (18.95 M CPU allocations: 10.387 GiB, 31.52% gc time) (172.40 k GPU allocations: 734.200 GiB, 4.04% gc time)
 48.672160 seconds (18.95 M allocations: 10.388 GiB, 31.44% gc time)

# Knet (was pretty consistent already, so fewer measurements)
 63.134642 seconds (11.27 M allocations: 5.240 GiB, 47.19% gc time)
 65.263926 seconds (11.26 M allocations: 5.239 GiB, 48.70% gc time)
 63.749779 seconds (11.26 M allocations: 5.239 GiB, 47.25% gc time)
 65.495878 seconds (11.26 M allocations: 5.239 GiB, 48.30% gc time)
 64.180856 seconds (11.26 M allocations: 5.239 GiB, 47.76% gc time)
 65.591660 seconds (11.26 M allocations: 5.239 GiB, 48.90% gc time)
 66.132638 seconds (11.26 M allocations: 5.239 GiB, 49.10% gc time)

@ekinakyurek
Copy link
Collaborator

Hi @maleadt, @MikeInnes , @denizyuret

I benchmarked again and the results are still similar with the older experiments that I run on K80 and V100 GPU machines. I think you somehow overfit the allocation strategy to 2080Ti or something else.

AWS-P2Xlarge-K80 Benchmark Info

Here is the updated script
Here is the updated log

knet#cuarrays

 78.341572 seconds (18.58 M allocations: 10.364 GiB, 1.42% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 78.549632 seconds (18.58 M allocations: 10.364 GiB, 1.41% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 78.663026 seconds (18.58 M allocations: 10.364 GiB, 1.42% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing

knet#master

arrtype = KnetArray{Float32,N} where N
 68.307643 seconds (11.11 M allocations: 5.232 GiB, 0.75% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 68.499359 seconds (11.12 M allocations: 5.232 GiB, 0.61% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing
 68.678919 seconds (11.12 M allocations: 5.232 GiB, 0.61% gc time)
#= none:1 =# @time(benchmark(M, o; N=100)) = nothing

Note:

I believe if you could explain a bit what you updated in CuArrays after benchmarks and how it differs than Knet's allocation strategy or how is different than previous version of CuArrays that would be more helpful for everybody.

@maleadt
Copy link
Collaborator

maleadt commented Feb 16, 2019

I think you somehow overfit the allocation strategy to 2080Ti or something else.

There's not much to fit though, the allocator hasn't got any params (except for the idle-time memory freeing, but that isn't relevant here). The only thing I can imagine is whether we first try to CUDA.alloc or GC.gc(false); the performance between both might differ with the CUDA version / application type (amount of allocated objects) / CPU performance.

I believe if you could explain a bit what you updated in CuArrays after benchmarks and how it differs than Knet's allocation strategy or how is different than previous version of CuArrays that would be more helpful for everybody.

The allocator has been refactored and its behavior is now pretty clear (IMO) from the main state machine's code: https://github.com/JuliaGPU/CuArrays.jl/blob/f5cbcfcf329b7e15a9bbbf0f344c5613a83e7148/src/memory.jl#L240-L330

Furthermore, I added timer outputs which yield an output like this when calling CuArrays.pool_timings():

 ───────────────────────────────────────────────────────────────────────────────────
                                            Time                   Allocations      
                                    ──────────────────────   ───────────────────────
          Tot / % measured:              5.25s / 54.1%            389MiB / 2.28%    

 Section                    ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────────
 pooled alloc                1.25k    2.84s   100%  2.26ms   8.86MiB  100%   7.24KiB
   step 4: reclaim unused       63    1.28s  45.3%  20.4ms   28.2KiB  0.31%        -
     reclaim                    63    299ms  10.5%  4.75ms         -  0.00%        -
     scan                       63   84.3μs  0.00%  1.34μs   20.7KiB  0.23%        -
   step 5: gc(true)              3    513ms  18.1%   171ms    310KiB  3.42%   103KiB
   step 3: gc(false)            93    485ms  17.1%  5.22ms    162KiB  1.78%  1.74KiB
   step 2: try alloc           656    474ms  16.7%   723μs   38.9KiB  0.43%        -
   step 1: check pool        1.25k    167μs  0.01%   133ns         -  0.00%        -
 scan                            1   1.60μs  0.00%  1.60μs         -  0.00%        -
 reclaim                         1   1.45μs  0.00%  1.45μs         -  0.00%        -
 ───────────────────────────────────────────────────────────────────────────────────

Adding similar timings to Knet should make it possible to compare behavior, both in terms of how many allocations happen and how much time is spent at them.

@MikeInnes
Copy link
Collaborator Author

@ekinakyurek do you have any timings comparing this with the max_pool setting I pointed out earlier? Would be useful to know if disabling that gives you any speedup on your setup.

@MikeInnes
Copy link
Collaborator Author

@Keno is looking at GPU performance now, so might have some idea of what the low hanging fruit is.

@ekinakyurek
Copy link
Collaborator

@MikeInnes sorry for the late reply, I am going to look at it today.

@Keno
Copy link

Keno commented Feb 23, 2019

Still looking into it. Hopefully I can just fix the low hanging fruits and things will magically get faster.

@ekinakyurek
Copy link
Collaborator

ekinakyurek commented Feb 23, 2019

@MikeInnes
CuArrays.MAX_POOL =10000*1024^2 performed slightly worse,
CuArrays.MAX_POOL =1000*1024^2 performed as same with my old measurements,
CuArrays.MAX_POOL =100*1024^2 original one,
CuArrays.MAX_POOL =10*1024^2 performed badly,

@denizyuret
Copy link
Owner

1.4 has CuArrays fully integrated.

@denizyuret denizyuret closed this Sep 1, 2020
@denizyuret denizyuret deleted the cuarrays branch September 1, 2020 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants