Training Code - New Features #106

melgor · 2016-03-09T22:21:54Z

Adding new Features for Training Code, which will clean the code, reduce memory consumption during training and enable multiple-GPU.
Based on talk: https://groups.google.com/forum/?hl=en#!topic/cmu-openface/9Be_VMC659c

adding cudnn.convert
use shareGradient idea from fb.resnet. Replaced by https://github.com/fmassa/optimize-net
support multi-GPU
adding in-place activation to all nets (optnet take care about this)
adding clearState
add Transform class for Data Augumentation
speedup forward pass in Tripelt-Criterion
speedup choosing triplets by calculating distance matrix
speedup updating gradient by using single state per network (currently every layer have own state)

melgor · 2016-03-11T10:24:17Z

I tried to add "cudnn.convert". Unfortunately, there is a problem with GPU memory consumption. I was using it in my other projects and it works fine.
I tested it and it is connected with
model:float()
Without it it works fine...

Replace nn_to_cudnn by cudnn.convert #106

fmassa · 2016-03-17T15:27:47Z

@melgor About the shareGradient part, I've started working on a generic code that allows to reuse the intermediary buffers of the network without having to define the sharing by hand https://github.com/fmassa/optimize-net
There are unit tests for common architectures, and I'll soon finish a version which works for training as well (there is an open PR for that already which works, just need to update the README and add some more tests).
I'll also soon add a more robust option for clearing the gradWeights when in inference mode (current simple approach loses the sharings if there are some), but that's not difficult to add.

melgor · 2016-05-09T07:49:32Z

I was trying to add muli-GPU stuff to the code but I failed. I have similar problem like in adding cuDNN. This is exacly about converting model to float before saving.
https://github.com/cmusatyalab/openface/blob/master/training/train.lua#L91
Do we really need it? For sure everybody will use CUDA for training. We it would be better to provide the script which will convert it to float.
But this will add other problem: we would not able to use any "clearState" solution, because it will break memory optimization made by 'optnet'

To sum up:

It would be easier to save CUDa model.
Saved model will be bigger than now (as we save gradient matrices etc.)
It will need additional script for "clearState" and "float".

What to you think about it, @bamos ?

bamos · 2016-05-09T10:40:42Z

Hi @melgor - it seems reasonable to save the CUDA model to disk and then use a script that can convert it to float and clearing the state if it makes adding these features easier.

I have some issues running out of disk space so it's not ideal to use a larger model here. I wonder how much disk space compressing the model with something like https://groups.google.com/forum/#!topic/torch7/7hCpmhv20KY will save.

-Brandon.

stale · 2017-11-18T21:15:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

melgor mentioned this issue Mar 11, 2016

Replace nn_to_cudnn by cudnn.convert #106 #108

Merged

bamos pushed a commit that referenced this issue Mar 16, 2016

Merge pull request #108 from melgor/cudnn_convert

e7c5748

Replace nn_to_cudnn by cudnn.convert #106

melgor mentioned this issue Mar 16, 2016

Replace sanitize by clearState() #110

Merged

bamos mentioned this issue Mar 19, 2016

Testing on LFW after an epoch fails #112

Closed

melgor mentioned this issue Mar 30, 2016

Added optnet for reducing memory consumption #118

Merged

bamos mentioned this issue Apr 24, 2016

Trained CUDA model unable to run on x86 #127

Closed

bamos mentioned this issue May 8, 2016

Memory limit on a table somewhere? #132

Closed

melgor mentioned this issue Jun 24, 2016

Added multiGPU support compatible with optnet #154

Closed

melgor mentioned this issue Jul 12, 2016

Rebase TripletEmbedding to get speedup at forward pass #159

Merged

nhzandi pushed a commit to nhzandi/openface that referenced this issue Mar 28, 2017

Added Multi-GPU support cmusatyalab#106

94bfe63

stale bot added the stale label Nov 18, 2017

stale bot closed this as completed Nov 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Code - New Features #106

Training Code - New Features #106

melgor commented Mar 9, 2016 •

edited

Loading

melgor commented Mar 11, 2016

fmassa commented Mar 17, 2016

melgor commented May 9, 2016

bamos commented May 9, 2016

stale bot commented Nov 18, 2017

Training Code - New Features #106

Training Code - New Features #106

Comments

melgor commented Mar 9, 2016 • edited Loading

melgor commented Mar 11, 2016

fmassa commented Mar 17, 2016

melgor commented May 9, 2016

bamos commented May 9, 2016

stale bot commented Nov 18, 2017

melgor commented Mar 9, 2016 •

edited

Loading