-
Notifications
You must be signed in to change notification settings - Fork 2
Multi-GPU #6
Comments
So, what is currently the issue with multi-threading ? |
I'm not sure how familiar you are with multi-threading in Python and the horror of the GIL, but I think that's basically the problem i.e. Theano doesn't release the GIL often and long enough for Python to actually benefit from multiple threads. There might also be problems with Theano's host-device synchronisation points, which force the other threads to wait as well, but I'm not sure about that. Either way, apparently using two threads actually resulted in a slowdown instead of a speedup, so we're going the multi-processing way (which is generally the recommended way in Python). This should be fine, just need to make sure that it is easy for the shared processes to read and write parameters to a block of shared memory, because other IPC methods will cause an unnecessary slowdown. |
Keeping the parameters same, (number_of_samples, batch_size, number_of_epochs etc), I ran the MultiGPU code on 1,2,4 GPU's. Here are the timings for each of them Training time(max_mb) 117000.37121s (single GPU) |
Awesome! It's too bad to see the diminishing returns so quickly (small
|
Definitely, it improved but validation error was more in that case. So I didn't pay much attention to that. Probably, I should thoroughly play with the parameter. Right now, I was synchronising after every iteration. (The results, I gave you on) Do you think something else could be useful too ? |
I performed another experiment with 5 GPU's Training time(max_mb) 117000.37121s (single GPU) I don't exactly know now, why it performed somewhat better though. |
Do you mean the validation error increased as you increase the number of iterations between parameter synchronization? That's interesting. If you look at the EASGD paper you'll see that they have experiments with the communication period as high as 64, and it actually decreases their validation error. I guess that is because they are overfitting on CIFAR, and a higher communication period acts as a regulariser, whereas we might still be underfitting? Do you have training curves? In that case we could consider training with high communication periods at first to get very quickly to a reasonable error, and then fine-tune with a lower communication period (or fewer GPUs) later on. Is that behaviour with 5 GPUs reproducible by the way? 2 to 4 GPUs decreases the training time with 5%, and then 4 to 5 reduces it with 15%? |
Yes, validation error increased as I increased the number of iterations b/w synchronization. I don't have any training curves (right now). I will probably run the experiments again keeping communication period as a variable, and see what happens. Yes, Experiment with 5 GPU's is reproduciblle(I ran 2 times, and the timings were 48,765s and 48158s resp). Yesterday, I ran with 6 GPU's. Training time(max_mb) 117000.37121s (single GPU) |
I got a possible explanation for the weird thing happening. I was using kepler GPU, and when I was using 4 GPU's, I used 3 which were in one box and the other which was in other box, and when I saw the communication times b/w the 2 boxes it was 10 times slower as compared to within 1 box, and hence not enough speedup while using 4 GPU's. I actually asked Fred and he confirmed that it may be possible that it is the reason due to this behaviour. |
Just a clarification, this isn't between computer node. It currently only This is on a node that have 2 CPUs. This was between GPUs attached to Comparing this to when the 4 GPUs are on the same CPU, is needed to know if On Wed, Jan 6, 2016 at 4:11 PM, Anirudh Goyal notifications@github.com
|
Thank you Fred for the update. |
See: mila-iqia/platoon#14 Can you confirm that your timing was for a fixed number of iteration? The timing done by @abergeron was done on kepler computer and was on the gpu3 to gpu7. So on the same group of GPU with efficient communication. |
The link above tell that when you raise the number of worker, you must at least update the learning rate and the alpha parameter of easgd to keep efficient learning. |
Yes, For a fixed number of iterations. I kept the learning rate constant but yes, I changed the alpha parameter of easgd, as (1/number_of_workers). Also, from my experiments, increasing batch size with increasing number of workers did not help at all, infact it worsened the situation, (for me batch size of 32 worked well), changing it to 128, or 64 did not help. |
raising the number of worker can be seen in some view as similar to raising the batch sizes. Can you add as a comparison point 2 GPU with batch size 32(alredy done) and 64(todo) to compare again 4 GPU training efficiency with batch size 32 and 64? Or you need to change the learning rate or try other alpha parameter. |
Okay, I'll do the experiments and update here. |
Using 4 GPU's, keeping batch size of 64, using alpha parameter = 0.5 --- 74041.303532s (all the 4 GPU's were on the same CPU) |
The comment with the timming for 1,2,4,5,6 GPUs was with which batch size? 32? What is this timing the time for a fixed number of mini-batch or the time to get a certain error? When you used a batch size of 64, did you see in total the same number of example? I suppose not with the result you gave. Can you run the timing with batch size 64, but with a fixed total number of example seen, not a fixed number of batch seen. So this mean, half the number of batch seen. Can you confirm that the timing is wall clock time between the start and end of training? A question, do you use the "valid" command from the Controler or you just train? See mila-iqia/platoon#15 |
Yes, For rest of my experiments were with batch size = 32 I saw the same number of minibatches while runing with batch size of 64. Yes, the timing is wall clock b/w the start and end of training. Yes, I do use the valid command from the controler |
So with batch size 64 it's actually slower than using 2 GPUs? That's interesting, because in the EASGD paper they use batches of size 128... I'm just looking at the paper now, and instead of specifying α, they seem to specify β = 0.9, where β = α * p (p is the number of workers). This means that they scale α with the number of workers as you did.
However, with 4 workers, β = 0.9 implies α = 0.225, quite a bit lower, right? There is also footnote that reads:
This means that as we increase τ, the communication period, we would have to decrease the learning rate and/or α similarly in order to keep ρ (the amount of exploration done) similar. Is there a chance you could put these results in a spreadsheet somewhere, including the values for α, β, τ, p and η? |
Yes, I will put these results in a spreadsheet , including the values for α, β, τ, p and η. |
alpha was kept constant at 0.5 in all those experiments. This do not check WIth batch size of 64, as it see the same number of batch, it is normal I think we should understand why there is no more computation efficency Also, as we check the efficiency of computation, you can lower the number On Fri, Jan 8, 2016 at 3:10 PM Bart van Merriënboer <
|
Right, I was thinking of efficiency of learning, not computation. I forgot that the number of epochs was fixed. Why does it see more examples when the batch size is 64? I thought that one epoch would be defined as each example being seen exactly once by a GPU, in which case the batch size doesn't actually change the number of total examples seen. |
It was my mistake, I was training it for maximum number of mini-batches seen, and not according to the number of epochs. Nevertheless, I ran only 2 experiments with batch size 64, and other with batch size 128. Rest were reported with the batch_size = 32 only. |
Okay, I just noticed you mentioned that above, skimmed too quickly, sorry! So in that case I guess @nouiz is right, and we really need to figure out why the 4 GPU case is so slow. You can print timings with these classes from Blocks by the way: https://github.com/mila-udem/blocks/blob/master/blocks/utils/profile.py profile = Profile()
with Timer('training', profile):
# Training code
with Timer('sync', profile):
# Sync code
# etc
profile.report() |
So, Just to be on the same page, I am running now, with batch_size = 32, alpha = 0.5 on 4 GPU's. |
Summary Timings - (For equal number of examples) @bartvm @nouiz 1 GPU batch_size=32 1 GPU batch_size = 64 2 GPU's batch_size = 32, alpha = 0.5 [same number of examples] Worker_Num Time_training Time_syncing Time_waiting (in seconds)
Time - 7999.229232(s) Worker_Num Time_training Time_syncing Time_waiting (in seconds)
2 GPU's batch_size = 64, alpha = 0.5 [same number of examples] Worker_Num Time_training Time_syncing Time_waiting (in seconds)
4 GPU's batch_size = 32, alpha = 0.5 Worker_Num Time_training Time_syncing Time_waiting (in seconds)
4 GPU's batch_size = 64, alpha = 0.25 Worker_Num Time_training Time_syncing Time_waiting
4 GPU's batch_size = 64, alpha = 0.5 Worker_Num Time_training Time_syncing Time_waiting
4 GPU's batch_size = 64, alpha = 0.75 Worker_Num Time_training Time_syncing Time_waiting
|
Those numbers give good speed up of 4 GPU vs 2. What changed?
|
So to summarize, with batch size 64:
And for batch size 32:
So in this case we see a significant speedup from 2 to 4, but from 1 to 2 we see very little speedup and even a slowdown when batches of size 64? In the original runs we saw a significant speedup:
The repeated tests seem to suggest that the variance isn't actually that high, so it's not just a measurement error I guess. So what was different that made the original experiments give a speedup while these new experiments don't see any? |
I didn't changed anything, I just ran the same old code with less number of mini batches. |
I ran experiments again with batch_size = 64, 1 GPU - 9718.92s, 9512.32s, 9634.92s |
Great, those numbers make more sense! The speedup from 1 to 2 is lower because of synchronization and locks, but from 2 to 4 it's almost linear it seems. I guess there was just a bug in the earlier 1 GPU runs? |
I skimmed through the logs for the previous version of single GPU, everything there seems to be fine according to the numbers, but now its consistent! |
I'm not convinced there is no more issue. But the issue could be outside It could that that we need to select which CPU is used when a given GPU is On Wed, Jan 13, 2016 at 10:58 AM, Anirudh Goyal notifications@github.com
|
4 GPU - 3322.83s, 3423.21s (Batch_size - 64) Makes sense to me. Now I would be going for hyperparameter search (α, β, τ, p and η) with respect to training and validation error, but the hyperparameters would vary with number of GPUs, so I am plannning to go with 4 GPU's ? Okay? |
ASGD is not doing better than EASGD for NMT, according to the validation error after training on certain number of minibatches. I tested it for 1,2,4 GPU's, and let them train for 2 days. And if I ignore the validation error and just consider according to the computation time, both are approximately the same. One of the reasons as @bartvm pointed out can be, that may be the dataset I am using (europarl) is small, and I should just move to a bigger dataset. |
This is kind of summary for the best hyperparameters till now. https://docs.google.com/document/d/1vhc8JZlsm5RDHAX-0u2g7KCto2X5g2M1pvPKtfd5vOQ/edit?usp=sharing |
After talking to @abergeron again, it seems that multi-threading is off the table. The current approach will be multi-processing with shared memory. We will need to implement this, so we will eventually need a training script which as an argument can take the name of a shared memory region to which it writes and reads parameters every N batches using a particular method (I like EASGD).
We will then need to try and get the maximum speedup out of this. We'll need to do some benchmarking to see how this scales to 2, 4, or maybe even 8 GPUs. We also need a way to still measure validation error (I guess we could have a separate GPU that copies and saves the parameters to disk, and then performs a validation run, this way we can do early stopping).
The text was updated successfully, but these errors were encountered: