Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multigpu #13

Merged
merged 16 commits into from
Jun 12, 2017
Merged

Multigpu #13

merged 16 commits into from
Jun 12, 2017

Conversation

fvisin
Copy link
Owner

@fvisin fvisin commented Jun 1, 2017

Completes #12.

This PR adds multiGPU support for training. The code loads N batches of size batchsize when N devices are specified. It then feeds each device with one batch. The code also deals with the case where the dataset returns a batch that is smaller than N*batchsize (e.g., at the end of a video when the number of sequences is not a multiple of N*batchsize)

fral92 and others added 4 commits May 25, 2017 11:14
* tf.split removed from graph, the placeholders are now in a list
* split the operations of the graph according to the number of working towers
* modify compute_chunk_size function in utils to compute the chunks directly
* Improve the structure by reusing code and using dictionaries
@fvisin fvisin requested a review from marcociccone June 1, 2017 10:13
* Select the parts of the graph we need, depending on the number of GPUs
  being used at each iteration (at runtime!)
* Select the summaries we care about, depending on the number of GPUs
  being used at each iteration (at runtime!)
* Create one graph per GPU and dataset split
* Collect all the summaries inside build_graph, as should be
Request that all the sequences in a minibatch belong to the same subset
@fvisin fvisin force-pushed the multigpu branch 2 times, most recently from 5d461f9 to 23c2396 Compare June 1, 2017 15:24
* Improve names
* Allow empty name in write_IoUs_summaries
* Use the number of epochs for the x-axis
@fvisin fvisin force-pushed the multigpu branch 2 times, most recently from 637f051 to 66db7c4 Compare June 2, 2017 12:37
Force each of the grad update ops to depend only on the variables that
are actually related to the devices linked to that op
Copy link
Collaborator

@marcociccone marcociccone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tested several times and it seems to be working correctly! The only thing I noticed it's that when you reload the parameters and continue the training, the plot of the IoU metrics become a mess... I don't know if it's related to this PR so we can merge and look at it later on

@fvisin
Copy link
Owner Author

fvisin commented Jun 12, 2017

Great thanks! I'll merge this one and open an issue with what you reported then!

@fvisin fvisin merged commit d2c7310 into master Jun 12, 2017
@fvisin fvisin deleted the multigpu branch June 12, 2017 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants