-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer data to GPU in a parallel CPU thread #62
Comments
The following is my simple template and benchmark comparing prefetching data in a CPU parallel thread to the normal version without prefetching. Basically, the training loop does not need to wait for data. However this is just twice as fast, not 10 times. So I guess the GPU can run 10 parallel trainings once all data is in GPU. Probably I need 10 threads for prefetching. Could you give some comments? Many thanks. macro swap(x,y)
quote
local tmp = $(esc(x))
$(esc(x)) = $(esc(y))
$(esc(y)) = tmp
end
end
# some slow function
@everywhere function get_data(i)
sleep(0.6)
println("get_data $i")
i
end
function slow_train(x)
sleep(0.6)
println("slow_train $x")
end
function prefetch(rng)
@assert length(rng) > 1
rng = collect(rng)
a = b = nothing
function _iter()
for i ∈ 1:length(rng)
if a == nothing
a = remotecall(get_data, 2, rng[i])
b = remotecall(get_data, 2, rng[i+1])
else
if i < length(rng)
a = remotecall(get_data, 2, rng[i+1])
end
@swap(a,b)
end
d = fetch(a)
produce(d)
end
end
return Task(_iter)
end
@time for x ∈ prefetch(1:10)
slow_train(x)
end
macro swap(x,y)
quote
local tmp = $(esc(x))
$(esc(x)) = $(esc(y))
$(esc(y)) = tmp
end
end
# some slow function
@everywhere function get_data(i)
sleep(0.6)
println("get_data $i")
i
end
function slow_train(x)
sleep(0.6)
println("slow_train $x")
end
function fetch(rng)
rng = collect(rng)
function _iter()
for i ∈ 1:length(rng)
d = get_data(rng[i])
produce(d)
end
end
return Task(_iter)
end
@time for x ∈ fetch(1:10)
slow_train(x)
end % julia test-task.jl
|
I am getting 0.1ms for 1000 batches of 64x1000 Float32. Did I misunderstand the problem? Here is my code:
|
To clarify: I meant 0.1ms per transfer. |
I tried your test and got 0.27s for the first
|
I think this is consistent with my results, not slower. Ignore the first result, it includes compilation time. You are transferring 1000 arrays in 0.073 seconds. This means per transfer cost is 0.073 ms or 73 μs, which is better than my setup. One call to cudaMalloc takes 10 μs, (gpu allocation is slow, which is why I had to write a custom memory manager for Knet). So it seems 63 μs is the cost of RAM->GPU transfer. |
For another data point, here is what I get on an AWS instance:
|
Thanks for clarifying the CPU to GPU time info. It is a common problem and independent of the framework then. As your benchmarks, it is fast enough though. |
GPU Memory allocation (converting Array to KnetArray) is really slow, 0.06s for each batch of 64x1000 Float32. 100 batches would take 6 seconds when I convert each batch to KnetArray in a training loop. I saw my GPU utilization was just about 5% through out training. 100 epochs training would take 10 minutes.
If I convert the whole training set to KnetArray prior to training. 100 epoches training takes just 1 minute, my GPU ultilization jumped to 90%. But for a large dataset, there is not enough memory.
My questions are:
The text was updated successfully, but these errors were encountered: