Buffer data to GPU in a parallel CPU thread #62

ngphuoc · 2017-01-31T04:24:10Z

GPU Memory allocation (converting Array to KnetArray) is really slow, 0.06s for each batch of 64x1000 Float32. 100 batches would take 6 seconds when I convert each batch to KnetArray in a training loop. I saw my GPU utilization was just about 5% through out training. 100 epochs training would take 10 minutes.

If I convert the whole training set to KnetArray prior to training. 100 epoches training takes just 1 minute, my GPU ultilization jumped to 90%. But for a large dataset, there is not enough memory.

My questions are:

Could the GPU memory allocation run faster?
Is it possible to convert Array to KnetArray up to GPU memory limit in a parallel thread and feed to the training loop when it need (using remotecall, @spawnat, and fetch) to avoid this bottle neck?

ngphuoc · 2017-01-31T23:41:11Z

The following is my simple template and benchmark comparing prefetching data in a CPU parallel thread to the normal version without prefetching. Basically, the training loop does not need to wait for data. However this is just twice as fast, not 10 times. So I guess the GPU can run 10 parallel trainings once all data is in GPU. Probably I need 10 threads for prefetching. Could you give some comments? Many thanks.

macro swap(x,y)
  quote
    local tmp = $(esc(x))
    $(esc(x)) = $(esc(y))
    $(esc(y)) = tmp
  end
end

# some slow function
@everywhere function get_data(i)
  sleep(0.6)
  println("get_data $i")
  i
end

function slow_train(x)
  sleep(0.6)
  println("slow_train $x")
end

function prefetch(rng)
  @assert length(rng) > 1
  rng = collect(rng)
  a = b = nothing
  function _iter()
    for i ∈ 1:length(rng)
      if a == nothing
        a = remotecall(get_data, 2, rng[i])
        b = remotecall(get_data, 2, rng[i+1])
      else
        if i < length(rng)
          a = remotecall(get_data, 2, rng[i+1])
        end
        @swap(a,b)
      end
      d = fetch(a)
      produce(d)
    end
  end
  return Task(_iter)
end
@time for x ∈ prefetch(1:10)
  slow_train(x)
end

% julia -p 2 test-task.jl
6.957115 seconds (153.23 k allocations: 6.454 MB)

macro swap(x,y)
  quote
    local tmp = $(esc(x))
    $(esc(x)) = $(esc(y))
    $(esc(y)) = tmp
  end
end

# some slow function
@everywhere function get_data(i)
  sleep(0.6)
  println("get_data $i")
  i
end

function slow_train(x)
  sleep(0.6)
  println("slow_train $x")
end

function fetch(rng)
  rng = collect(rng)
  function _iter()
    for i ∈ 1:length(rng)
      d = get_data(rng[i])
      produce(d)
    end
  end
  return Task(_iter)
end
@time for x ∈ fetch(1:10)
  slow_train(x)
end

% julia test-task.jl
12.146958 seconds (84.82 k allocations: 3.528 MB)

denizyuret · 2017-02-12T14:30:34Z

I am getting 0.1ms for 1000 batches of 64x1000 Float32. Did I misunderstand the problem? Here is my code:

using Knet

function togpu(a)
    b=Array(Any,length(a))
    @inbounds for i=1:length(a)
        b[i]=KnetArray(a[i])
    end
    return b
end

a = [ rand(Float32, 64, 1000) for i=1:1000 ]
@time a1=togpu(a);
@time a2=togpu(a);
@time a3=togpu(a);

denizyuret · 2017-02-12T14:31:21Z

To clarify: I meant 0.1ms per transfer.

ngphuoc · 2017-02-13T22:49:07Z

I tried your test and got 0.27s for the first @time, 0.07s for the second and third @time, which is
70 times slower than yours. Is this abnormal? My PC configuration is CPU i7-5820K + GPU GTX 1080

julia> using Knet
INFO: Knet using GPU 0

julia> function togpu(a)
           b=Array(Any,length(a))
           @inbounds for i=1:length(a)
               b[i]=KnetArray(a[i])
           end
           return b
       end
togpu (generic function with 1 method)

julia> a = [ rand(Float32, 64, 1000) for i=1:1000 ];

julia> @time a1=togpu(a);
  0.276411 seconds (243.87 k allocations: 10.282 MB)

julia> @time a2=togpu(a);
  0.073134 seconds (10.01 k allocations: 289.391 KB)

julia> @time a3=togpu(a);
  0.073607 seconds (10.01 k allocations: 289.391 KB)

denizyuret · 2017-02-14T03:12:08Z

I think this is consistent with my results, not slower. Ignore the first result, it includes compilation time. You are transferring 1000 arrays in 0.073 seconds. This means per transfer cost is 0.073 ms or 73 μs, which is better than my setup. One call to cudaMalloc takes 10 μs, (gpu allocation is slow, which is why I had to write a custom memory manager for Knet). So it seems 63 μs is the cost of RAM->GPU transfer.

denizyuret · 2017-02-14T03:30:44Z

For another data point, here is what I get on an AWS instance:

[ec2-user@ip-172-31-23-146 ~]$ julia foo.jl
INFO: Knet using GPU 0
  0.488581 seconds (243.72 k allocations: 10.312 MB)
  0.184520 seconds (10.01 k allocations: 289.391 KB)
  0.190965 seconds (10.01 k allocations: 289.391 KB)

ngphuoc · 2017-02-16T04:59:35Z

Thanks for clarifying the CPU to GPU time info. It is a common problem and independent of the framework then. As your benchmarks, it is fast enough though.

denizyuret added the question label Feb 12, 2017

denizyuret assigned denizyuret, ilkerkesen and EnisBerk Feb 15, 2017

ngphuoc closed this as completed Feb 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer data to GPU in a parallel CPU thread #62

Buffer data to GPU in a parallel CPU thread #62

ngphuoc commented Jan 31, 2017 •

edited

Loading

ngphuoc commented Jan 31, 2017

denizyuret commented Feb 12, 2017

denizyuret commented Feb 12, 2017

ngphuoc commented Feb 13, 2017

denizyuret commented Feb 14, 2017

denizyuret commented Feb 14, 2017

ngphuoc commented Feb 16, 2017

Buffer data to GPU in a parallel CPU thread #62

Buffer data to GPU in a parallel CPU thread #62

Comments

ngphuoc commented Jan 31, 2017 • edited Loading

ngphuoc commented Jan 31, 2017

denizyuret commented Feb 12, 2017

denizyuret commented Feb 12, 2017

ngphuoc commented Feb 13, 2017

denizyuret commented Feb 14, 2017

denizyuret commented Feb 14, 2017

ngphuoc commented Feb 16, 2017

ngphuoc commented Jan 31, 2017 •

edited

Loading