Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer data to GPU in a parallel CPU thread #62

Closed
ngphuoc opened this issue Jan 31, 2017 · 7 comments
Closed

Buffer data to GPU in a parallel CPU thread #62

ngphuoc opened this issue Jan 31, 2017 · 7 comments
Assignees
Labels

Comments

@ngphuoc
Copy link

ngphuoc commented Jan 31, 2017

GPU Memory allocation (converting Array to KnetArray) is really slow, 0.06s for each batch of 64x1000 Float32. 100 batches would take 6 seconds when I convert each batch to KnetArray in a training loop. I saw my GPU utilization was just about 5% through out training. 100 epochs training would take 10 minutes.

If I convert the whole training set to KnetArray prior to training. 100 epoches training takes just 1 minute, my GPU ultilization jumped to 90%. But for a large dataset, there is not enough memory.

My questions are:

  1. Could the GPU memory allocation run faster?
  2. Is it possible to convert Array to KnetArray up to GPU memory limit in a parallel thread and feed to the training loop when it need (using remotecall, @spawnat, and fetch) to avoid this bottle neck?
@ngphuoc
Copy link
Author

ngphuoc commented Jan 31, 2017

The following is my simple template and benchmark comparing prefetching data in a CPU parallel thread to the normal version without prefetching. Basically, the training loop does not need to wait for data. However this is just twice as fast, not 10 times. So I guess the GPU can run 10 parallel trainings once all data is in GPU. Probably I need 10 threads for prefetching. Could you give some comments? Many thanks.

macro swap(x,y)
  quote
    local tmp = $(esc(x))
    $(esc(x)) = $(esc(y))
    $(esc(y)) = tmp
  end
end

# some slow function
@everywhere function get_data(i)
  sleep(0.6)
  println("get_data $i")
  i
end

function slow_train(x)
  sleep(0.6)
  println("slow_train $x")
end

function prefetch(rng)
  @assert length(rng) > 1
  rng = collect(rng)
  a = b = nothing
  function _iter()
    for i  1:length(rng)
      if a == nothing
        a = remotecall(get_data, 2, rng[i])
        b = remotecall(get_data, 2, rng[i+1])
      else
        if i < length(rng)
          a = remotecall(get_data, 2, rng[i+1])
        end
        @swap(a,b)
      end
      d = fetch(a)
      produce(d)
    end
  end
  return Task(_iter)
end
@time for x  prefetch(1:10)
  slow_train(x)
end
% julia -p 2 test-task.jl
6.957115 seconds (153.23 k allocations: 6.454 MB)
macro swap(x,y)
  quote
    local tmp = $(esc(x))
    $(esc(x)) = $(esc(y))
    $(esc(y)) = tmp
  end
end

# some slow function
@everywhere function get_data(i)
  sleep(0.6)
  println("get_data $i")
  i
end

function slow_train(x)
  sleep(0.6)
  println("slow_train $x")
end

function fetch(rng)
  rng = collect(rng)
  function _iter()
    for i  1:length(rng)
      d = get_data(rng[i])
      produce(d)
    end
  end
  return Task(_iter)
end
@time for x  fetch(1:10)
  slow_train(x)
end

% julia test-task.jl
12.146958 seconds (84.82 k allocations: 3.528 MB)


@denizyuret
Copy link
Owner

I am getting 0.1ms for 1000 batches of 64x1000 Float32. Did I misunderstand the problem? Here is my code:

using Knet

function togpu(a)
    b=Array(Any,length(a))
    @inbounds for i=1:length(a)
        b[i]=KnetArray(a[i])
    end
    return b
end

a = [ rand(Float32, 64, 1000) for i=1:1000 ]
@time a1=togpu(a);
@time a2=togpu(a);
@time a3=togpu(a);

@denizyuret
Copy link
Owner

To clarify: I meant 0.1ms per transfer.

@ngphuoc
Copy link
Author

ngphuoc commented Feb 13, 2017

I tried your test and got 0.27s for the first @time, 0.07s for the second and third @time, which is
70 times slower than yours. Is this abnormal? My PC configuration is CPU i7-5820K + GPU GTX 1080

julia> using Knet
INFO: Knet using GPU 0

julia> function togpu(a)
           b=Array(Any,length(a))
           @inbounds for i=1:length(a)
               b[i]=KnetArray(a[i])
           end
           return b
       end
togpu (generic function with 1 method)

julia> a = [ rand(Float32, 64, 1000) for i=1:1000 ];

julia> @time a1=togpu(a);
  0.276411 seconds (243.87 k allocations: 10.282 MB)

julia> @time a2=togpu(a);
  0.073134 seconds (10.01 k allocations: 289.391 KB)

julia> @time a3=togpu(a);
  0.073607 seconds (10.01 k allocations: 289.391 KB)

@denizyuret
Copy link
Owner

I think this is consistent with my results, not slower. Ignore the first result, it includes compilation time. You are transferring 1000 arrays in 0.073 seconds. This means per transfer cost is 0.073 ms or 73 μs, which is better than my setup. One call to cudaMalloc takes 10 μs, (gpu allocation is slow, which is why I had to write a custom memory manager for Knet). So it seems 63 μs is the cost of RAM->GPU transfer.

@denizyuret
Copy link
Owner

For another data point, here is what I get on an AWS instance:

[ec2-user@ip-172-31-23-146 ~]$ julia foo.jl
INFO: Knet using GPU 0
  0.488581 seconds (243.72 k allocations: 10.312 MB)
  0.184520 seconds (10.01 k allocations: 289.391 KB)
  0.190965 seconds (10.01 k allocations: 289.391 KB)

@ngphuoc
Copy link
Author

ngphuoc commented Feb 16, 2017

Thanks for clarifying the CPU to GPU time info. It is a common problem and independent of the framework then. As your benchmarks, it is fast enough though.

@ngphuoc ngphuoc closed this as completed Feb 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants