Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GradientAccumulator wrapper not working as expected #2

Closed
andreped opened this issue Jun 1, 2022 · 6 comments
Closed

GradientAccumulator wrapper not working as expected #2

andreped opened this issue Jun 1, 2022 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@andreped
Copy link
Owner

andreped commented Jun 1, 2022

In gradient accumulation, we try to update the weights only after a given number of iterations (k number of batches), in an ensemble-based manner. For instance, by averaging across the gradients calculated for k batches, and only updated the weights then - simulating regular batch training.

After running the benchmark described here, using:

  1. batch_size=32, accum_steps=1, epochs=3
  2. batch_size=8, accum_steps=4, epochs=12

We do not get the same results. It seems like the weights are updated for every batch even though we use accum_steps > 4.

Both the original wrapper implementation GradientAccumulator and the Adam-based wrapper AdamAccumulate suffer from this.

Are we actually able to control when the weights are updated from the optimizer, or can we only calculate and get the gradients and enforce and update ourselves?

Obviously we can make our own train loop, but the whole point is to have a simple wrapper class which handles all this for us.

@andreped
Copy link
Owner Author

andreped commented Jun 1, 2022

Instructions on how to reproduce the issue has now been added here.

@andreped andreped added the bug Something isn't working label Jun 1, 2022
@andreped
Copy link
Owner Author

andreped commented Jun 1, 2022

Silly me. It is obviously wrong to run accum_steps times more epochs. The correct number of updates are already performed for a given epoch (have corrected for this now: c0d4f1b).

When I run the same number of epochs using the train_step overload approach (accum_opt=-1), I get almost identical results! Hence, I believe the GradientAccumulator implementation is close to ready for single-GPU scenarios.

Added prompt below from some benchmarks (removed some prints for readability):

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 1 --accum_steps 32 --epochs 3
Namespace(accum_opt=-1, accum_steps=32, batchsize=1, epochs=3)
Epoch 1/3
60000/60000 [==============================] - 56s 912us/step - loss: 0.2667 - sparse_categorical_accuracy: 0.9236 - val_loss: 0.1346 - val_sparse_categorical_accuracy: 0.9581
Epoch 2/3
60000/60000 [==============================] - 54s 902us/step - loss: 0.1167 - sparse_categorical_accuracy: 0.9656 - val_loss: 0.0992 - val_sparse_categorical_accuracy: 0.9700
Epoch 3/3
60000/60000 [==============================] - 61s 1ms/step - loss: 0.0802 - sparse_categorical_accuracy: 0.9758 - val_loss: 0.0874 - val_sparse_categorical_accuracy: 0.9725
10000/10000 [==============================] - 6s 623us/step - loss: 0.0874 - sparse_categorical_accuracy: 0.9725
[0.08742792904376984, 0.9725000262260437]

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 32 --accum_steps 1 --epochs 3
Namespace(accum_opt=-1, accum_steps=1, batchsize=32, epochs=3)
Epoch 1/3
1875/1875 [==============================] - 5s 2ms/step - loss: 0.2659 - sparse_categorical_accuracy: 0.9233 - val_loss: 0.1337 - val_sparse_categorical_accuracy: 0.9585
Epoch 2/3
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1155 - sparse_categorical_accuracy: 0.9662 - val_loss: 0.0974 - val_sparse_categorical_accuracy: 0.9708
Epoch 3/3
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0798 - sparse_categorical_accuracy: 0.9754 - val_loss: 0.0850 - val_sparse_categorical_accuracy: 0.9738
313/313 [==============================] - 0s 889us/step - loss: 0.0850 - sparse_categorical_accuracy: 0.9738
[0.08499550819396973, 0.973800003528595]

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 1 --accum_steps 1 --epochs 3
Namespace(accum_opt=-1, accum_steps=1, batchsize=1, epochs=3)
2022-06-01 21:04:02.166575: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: 
Epoch 1/3
60000/60000 [==============================] - 74s 1ms/step - loss: 0.2108 - sparse_categorical_accuracy: 0.9384 - val_loss: 0.1428 - val_sparse_categorical_accuracy: 0.9607
Epoch 2/3
60000/60000 [==============================] - 70s 1ms/step - loss: 0.1263 - sparse_categorical_accuracy: 0.9663 - val_loss: 0.1531 - val_sparse_categorical_accuracy: 0.9664
Epoch 3/3
60000/60000 [==============================] - 70s 1ms/step - loss: 0.1067 - sparse_categorical_accuracy: 0.9741 - val_loss: 0.1801 - val_sparse_categorical_accuracy: 0.9611
10000/10000 [==============================] - 6s 604us/step - loss: 0.1801 - sparse_categorical_accuracy: 0.9611
[0.18014205992221832, 0.9610999822616577]

Note that using bs=1 & acs=32 actually improves results vs bs=acs=1. Hence, performing gradient accumulation actually improves performance.

Doing bs=1 & acs=32 vs bs=32 & acs=1 produces almost identical results. Theoretically, they should be identical. As I performed this on GPU and handled all random-seed-issue-hell, there is no real explanation why they are not identical. I therefore believe this is minor bug somewhere. Likely the last update of an epoch not being performed, or something like that.

Lastly, note that training with acs=32 & bs=1 is a lot slower than acs=1 & bs=32. Luckily, this is expected, and the drawback with using accumulated gradients. However, we make it possible to use a much larger batch size than we normally could! If possible, one could try to increase batch size and reduce accumulation steps, e.g., bs=8 & acs=4, which should yield identical results to the one above (except bs=1 & acs=1 obviously).

@andreped andreped self-assigned this Jun 1, 2022
@andreped
Copy link
Owner Author

andreped commented Jun 1, 2022

However, interestingly enough, I don't get identical results using the wrapper approach (accum_opt=2):

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt 2 --batchsize 1 --accum_steps 32 --epochs 3
Namespace(accum_opt=2, accum_steps=32, batchsize=1, epochs=3)
Epoch 1/3
60000/60000 [==============================] - 67s 1ms/step - loss: 0.2396 - sparse_categorical_accuracy: 0.9293 - val_loss: 0.1308 - val_sparse_categorical_accuracy: 0.9624
Epoch 2/3
60000/60000 [==============================] - 62s 1ms/step - loss: 0.1344 - sparse_categorical_accuracy: 0.9636 - val_loss: 0.1394 - val_sparse_categorical_accuracy: 0.9663
Epoch 3/3
60000/60000 [==============================] - 62s 1ms/step - loss: 0.1144 - sparse_categorical_accuracy: 0.9706 - val_loss: 0.1338 - val_sparse_categorical_accuracy: 0.9689
10000/10000 [==============================] - 6s 600us/step - loss: 0.1338 - sparse_categorical_accuracy: 0.9689
[0.13378407061100006, 0.9689000248908997]

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt 2 --batchsize 32 --accum_steps 1 --epochs 3
Namespace(accum_opt=2, accum_steps=1, batchsize=32, epochs=3)
Epoch 1/3
1875/1875 [==============================] - 6s 2ms/step - loss: 0.2659 - sparse_categorical_accuracy: 0.9233 - val_loss: 0.1337 - val_sparse_categorical_accuracy: 0.9585
Epoch 2/3
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1155 - sparse_categorical_accuracy: 0.9662 - val_loss: 0.0974 - val_sparse_categorical_accuracy: 0.9708
Epoch 3/3
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0798 - sparse_categorical_accuracy: 0.9754 - val_loss: 0.0850 - val_sparse_categorical_accuracy: 0.9738
313/313 [==============================] - 0s 895us/step - loss: 0.0850 - sparse_categorical_accuracy: 0.9738
[0.08499550819396973, 0.973800003528595]

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt 2 --batchsize 1 --accum_steps 1 --epochs 3
Namespace(accum_opt=2, accum_steps=1, batchsize=1, epochs=3)
Epoch 1/3
60000/60000 [==============================] - 71s 1ms/step - loss: 0.2108 - sparse_categorical_accuracy: 0.9384 - val_loss: 0.1428 - val_sparse_categorical_accuracy: 0.9607
Epoch 2/3
60000/60000 [==============================] - 67s 1ms/step - loss: 0.1263 - sparse_categorical_accuracy: 0.9663 - val_loss: 0.1531 - val_sparse_categorical_accuracy: 0.9664
Epoch 3/3
60000/60000 [==============================] - 67s 1ms/step - loss: 0.1067 - sparse_categorical_accuracy: 0.9741 - val_loss: 0.1801 - val_sparse_categorical_accuracy: 0.9611
10000/10000 [==============================] - 6s 601us/step - loss: 0.1801 - sparse_categorical_accuracy: 0.9611
[0.18014205992221832, 0.9610999822616577]

Naturally, bs=32 & acs=1 are identical between the two accum_opt (-1 vs 2), but that is not surprising, as setting acs=1 essentially means disabling gradient accumulation, and therefore this is just usual behaviour from regular optimization (e.g., using ADAM/SGD).

Performance is also a lot worse using the GA wrapper.

@andreped andreped changed the title Updates happen too often? GradientAccumulator wrapper not working as expected Jun 1, 2022
@andreped
Copy link
Owner Author

andreped commented Jun 1, 2022

Just run a new experiment to further test the train_step overload approach:

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 32 --accum_steps 8 --epochs 24
Namespace(accum_opt=-1, accum_steps=8, batchsize=32, epochs=24)
Epoch 1/24
1875/1875 [==============================] - 5s 2ms/step - loss: 0.4577 - sparse_categorical_accuracy: 0.8751 - val_loss: 0.2493 - val_sparse_categorical_accuracy: 0.9287
Epoch 2/24
1875/1875 [==============================] - 3s 1ms/step - loss: 0.2141 - sparse_categorical_accuracy: 0.9391 - val_loss: 0.1825 - val_sparse_categorical_accuracy: 0.9461
Epoch 3/24
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1566 - sparse_categorical_accuracy: 0.9561 - val_loss: 0.1425 - val_sparse_categorical_accuracy: 0.9591

(venv) PS C:\Users\47955\workspace\GradientAccumulator> python .\benchmark.py --accum_opt -1 --batchsize 256 --accum_steps 1 --epochs 3
Namespace(accum_opt=-1, accum_steps=1, batchsize=256, epochs=3)
Epoch 1/3
235/235 [==============================] - 2s 4ms/step - loss: 0.4580 - sparse_categorical_accuracy: 0.8748 - val_loss: 0.2495 - val_sparse_categorical_accuracy: 0.9286
Epoch 2/3
235/235 [==============================] - 1s 3ms/step - loss: 0.2143 - sparse_categorical_accuracy: 0.9393 - val_loss: 0.1829 - val_sparse_categorical_accuracy: 0.9464
Epoch 3/3
235/235 [==============================] - 1s 2ms/step - loss: 0.1570 - sparse_categorical_accuracy: 0.9560 - val_loss: 0.1423 - val_sparse_categorical_accuracy: 0.9591

We get close to identical results from using bs=256 & acs=1 vs bs=32 & acs=8, which is expected behaviour! However, there is a minor difference. Here it is definitely due to #updates not being the same for each experiment, as the total number of samples (N=60000) is not a multiplum by 256 (60000 % 256 != 0).

But then we can conclude with that the train_step overload approach is a viable option. Sadly, the wrapper approach did not have the same success (yet).

@andreped
Copy link
Owner Author

andreped commented Jun 1, 2022

Started a discussion #3 if anyone are interested in discussing this further. Will keep this Issue open for now for new users.

@andreped
Copy link
Owner Author

andreped commented Jun 3, 2022

Fixed in 5f1a703

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant