Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow training progress #17

Closed
ahundt opened this issue Apr 20, 2017 · 3 comments
Closed

slow training progress #17

ahundt opened this issue Apr 20, 2017 · 3 comments

Comments

@ahundt
Copy link
Collaborator

ahundt commented Apr 20, 2017

I'm running with the current master and I'm not seeing the performance described in #4, perhaps something is up with my converted dataset?


11127/11127 [==============================] - 7896s - loss: 1.3891 - sparse_accuracy_ignoring_last_label: 0.6165
lr: 0.009964
Epoch 2/250
11127/11127 [==============================] - 7972s - loss: 1.0751 - sparse_accuracy_ignoring_last_label: 0.6326
lr: 0.009928
Epoch 3/250
11127/11127 [==============================] - 7937s - loss: 1.0529 - sparse_accuracy_ignoring_last_label: 0.6385
lr: 0.009892
Epoch 4/250
11127/11127 [==============================] - 7878s - loss: 1.0487 - sparse_accuracy_ignoring_last_label: 0.6407
lr: 0.009856
Epoch 5/250
11127/11127 [==============================] - 7915s - loss: 1.0411 - sparse_accuracy_ignoring_last_label: 0.6434
lr: 0.009820
Epoch 6/250
11127/11127 [==============================] - 7849s - loss: 1.0374 - sparse_accuracy_ignoring_last_label: 0.6447
lr: 0.009784
Epoch 7/250
11127/11127 [==============================] - 7843s - loss: 1.0358 - sparse_accuracy_ignoring_last_label: 0.6448
lr: 0.009748
Epoch 8/250
 6808/11127 [=================>............] - ETA: 3041s - loss: 1.0342 - sparse_accuracy_ignoring_last_label: 0.6447

Also training is taking a lot longer than I imagined with around 2 hours per epoch, is that typical with the full 11k images from pascal voc + the berkeley dataset? I'm running on a GTX1080 with a batch size of 16 and the files are stored on an HDD, not an SSD, though theoretically linux does some caching for this sort of thing and it could all fit in my 48GB of system ram.

@ahundt
Copy link
Collaborator Author

ahundt commented Apr 20, 2017

For reference that means these parameters (and SGD is the optimizer):

    model_name = 'AtrousFCN_Resnet50_16s'
    batch_size = 16
    batchnorm_momentum = 0.95
    epochs = 250
    lr_base = 0.01 * (float(batch_size) / 16)
    lr_power = 0.9
    resume_training = False
    if model_name is 'AtrousFCN_Resnet50_16s':
        weight_decay = 0.0001/2
    else:
        weight_decay = 1e-4
    classes = 21
    target_size = (320, 320)
    dataset = 'VOC2012_BERKELEY'

@aurora95
Copy link
Owner

aurora95 commented Apr 20, 2017

To be honest I didn't test the new code...

But my old code running on an old Titan X takes about 600-700s per epoch for AtrousFCN_Resnet50_16s, batch_size=16, target_size=(320, 320) on the 11k dataset. Also the accuracy after the first epoch should be 0.78 or something around. So you must have big problems...

@ahundt
Copy link
Collaborator Author

ahundt commented Apr 20, 2017

Oh man so silly... the branch on my laptop was different from my training workstation, sorry about that. The numbers look like you describe now once I checked out the right branch and enabled SGD.

 1398/11127 [==>...........................] - ETA: 6660s - loss: 1.0014 - sparse_accuracy_ignoring_last_label: 0.7451

For some reason epochs are still super slow for me but that's probably particular to my machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants