Multigpu #13

fvisin · 2017-06-01T10:13:37Z

Completes #12.

This PR adds multiGPU support for training. The code loads N batches of size batchsize when N devices are specified. It then feeds each device with one batch. The code also deals with the case where the dataset returns a batch that is smaller than N*batchsize (e.g., at the end of a video when the number of sequences is not a multiple of N*batchsize)

* tf.split removed from graph, the placeholders are now in a list * split the operations of the graph according to the number of working towers * modify compute_chunk_size function in utils to compute the chunks directly

* Improve the structure by reusing code and using dictionaries

* Select the parts of the graph we need, depending on the number of GPUs being used at each iteration (at runtime!) * Select the summaries we care about, depending on the number of GPUs being used at each iteration (at runtime!) * Create one graph per GPU and dataset split * Collect all the summaries inside build_graph, as should be

Request that all the sequences in a minibatch belong to the same subset

* Improve names * Allow empty name in write_IoUs_summaries * Use the number of epochs for the x-axis

Force each of the grad update ops to depend only on the variables that are actually related to the devices linked to that op

marcociccone

LGTM, I tested several times and it seems to be working correctly! The only thing I noticed it's that when you reload the parameters and continue the training, the plot of the IoU metrics become a mess... I don't know if it's related to this PR so we can merge and look at it later on

fvisin · 2017-06-12T10:20:20Z

Great thanks! I'll merge this one and open an issue with what you reported then!

fral92 and others added 4 commits May 25, 2017 11:14

Fix multi-gpu fault in the case of smaller batches

6ceeb14

* tf.split removed from graph, the placeholders are now in a list * split the operations of the graph according to the number of working towers * modify compute_chunk_size function in utils to compute the chunks directly

Refactoring multigpu PR

d76e334

Improve and complete multi-GPU code

55ea673

* Improve the structure by reusing code and using dictionaries

Fix print exception message in save_images

4f57995

fvisin requested a review from marcociccone June 1, 2017 10:13

fvisin added 6 commits June 1, 2017 13:16

Remove unused imports

91a3b90

Add one_subset_per_batch in validate

1a53dcf

Request that all the sequences in a minibatch belong to the same subset

Fix should not trim multiGPU pred with num_split

49e360d

Ignore unused output

3d5d00f

Fix reset cm at the end of the video iif summary_per_subset

30556c5

fvisin force-pushed the multigpu branch 2 times, most recently from 5d461f9 to 23c2396 Compare June 1, 2017 15:24

Improve IoU summaries

52652b7

* Improve names * Allow empty name in write_IoUs_summaries * Use the number of epochs for the x-axis

fvisin force-pushed the multigpu branch 2 times, most recently from 637f051 to 66db7c4 Compare June 2, 2017 12:37

fvisin added 4 commits June 2, 2017 15:29

Improve names and remove old code/comments

5113b61

Make grad update op depend only on subset of vars

fd8f926

Force each of the grad update ops to depend only on the variables that are actually related to the devices linked to that op

Fix gradient is miscomputed

4bb89db

Improve assert error message

4b356bf

fvisin force-pushed the multigpu branch from 66db7c4 to 4b356bf Compare June 2, 2017 13:29

Fix batches are not saved in separate frames

b652553

marcociccone approved these changes Jun 8, 2017

View reviewed changes

fvisin mentioned this pull request Jun 12, 2017

IoU is not shown correctly after reloading #14

Closed

fvisin merged commit d2c7310 into master Jun 12, 2017

fvisin deleted the multigpu branch June 12, 2017 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multigpu #13

Multigpu #13

fvisin commented Jun 1, 2017

marcociccone left a comment

fvisin commented Jun 12, 2017

Multigpu #13

Multigpu #13

Conversation

fvisin commented Jun 1, 2017

marcociccone left a comment

Choose a reason for hiding this comment

fvisin commented Jun 12, 2017