slow training progress #17

ahundt · 2017-04-20T15:37:37Z

I'm running with the current master and I'm not seeing the performance described in #4, perhaps something is up with my converted dataset?


11127/11127 [==============================] - 7896s - loss: 1.3891 - sparse_accuracy_ignoring_last_label: 0.6165
lr: 0.009964
Epoch 2/250
11127/11127 [==============================] - 7972s - loss: 1.0751 - sparse_accuracy_ignoring_last_label: 0.6326
lr: 0.009928
Epoch 3/250
11127/11127 [==============================] - 7937s - loss: 1.0529 - sparse_accuracy_ignoring_last_label: 0.6385
lr: 0.009892
Epoch 4/250
11127/11127 [==============================] - 7878s - loss: 1.0487 - sparse_accuracy_ignoring_last_label: 0.6407
lr: 0.009856
Epoch 5/250
11127/11127 [==============================] - 7915s - loss: 1.0411 - sparse_accuracy_ignoring_last_label: 0.6434
lr: 0.009820
Epoch 6/250
11127/11127 [==============================] - 7849s - loss: 1.0374 - sparse_accuracy_ignoring_last_label: 0.6447
lr: 0.009784
Epoch 7/250
11127/11127 [==============================] - 7843s - loss: 1.0358 - sparse_accuracy_ignoring_last_label: 0.6448
lr: 0.009748
Epoch 8/250
 6808/11127 [=================>............] - ETA: 3041s - loss: 1.0342 - sparse_accuracy_ignoring_last_label: 0.6447

Also training is taking a lot longer than I imagined with around 2 hours per epoch, is that typical with the full 11k images from pascal voc + the berkeley dataset? I'm running on a GTX1080 with a batch size of 16 and the files are stored on an HDD, not an SSD, though theoretically linux does some caching for this sort of thing and it could all fit in my 48GB of system ram.

The text was updated successfully, but these errors were encountered:

ahundt · 2017-04-20T15:48:15Z

For reference that means these parameters (and SGD is the optimizer):

    model_name = 'AtrousFCN_Resnet50_16s'
    batch_size = 16
    batchnorm_momentum = 0.95
    epochs = 250
    lr_base = 0.01 * (float(batch_size) / 16)
    lr_power = 0.9
    resume_training = False
    if model_name is 'AtrousFCN_Resnet50_16s':
        weight_decay = 0.0001/2
    else:
        weight_decay = 1e-4
    classes = 21
    target_size = (320, 320)
    dataset = 'VOC2012_BERKELEY'

aurora95 · 2017-04-20T16:12:09Z

To be honest I didn't test the new code...

But my old code running on an old Titan X takes about 600-700s per epoch for AtrousFCN_Resnet50_16s, batch_size=16, target_size=(320, 320) on the 11k dataset. Also the accuracy after the first epoch should be 0.78 or something around. So you must have big problems...

ahundt · 2017-04-20T20:58:28Z

Oh man so silly... the branch on my laptop was different from my training workstation, sorry about that. The numbers look like you describe now once I checked out the right branch and enabled SGD.

 1398/11127 [==>...........................] - ETA: 6660s - loss: 1.0014 - sparse_accuracy_ignoring_last_label: 0.7451

For some reason epochs are still super slow for me but that's probably particular to my machine.

ahundt closed this as completed Apr 20, 2017

ahundt mentioned this issue Apr 20, 2017

AtrousFCN_Resnet50_16s training WORKING #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow training progress #17

slow training progress #17

ahundt commented Apr 20, 2017

ahundt commented Apr 20, 2017 •

edited

Loading

aurora95 commented Apr 20, 2017 •

edited

Loading

ahundt commented Apr 20, 2017 •

edited

Loading

slow training progress #17

slow training progress #17

Comments

ahundt commented Apr 20, 2017

ahundt commented Apr 20, 2017 • edited Loading

aurora95 commented Apr 20, 2017 • edited Loading

ahundt commented Apr 20, 2017 • edited Loading

ahundt commented Apr 20, 2017 •

edited

Loading

aurora95 commented Apr 20, 2017 •

edited

Loading

ahundt commented Apr 20, 2017 •

edited

Loading