bigscience-workshop · stas00 · Oct 10, 2021 · Oct 7, 2021 · Oct 8, 2021 · Oct 8, 2021
diff --git a/examples/curriculum_learning/README.md b/examples/curriculum_learning/README.md
@@ -9,7 +9,7 @@ Because CL changes length of each sequence/sample during training, it is very ha
 
 # Token-based LR decay
 
-Again because CL changes the number of tokens per batch, in our [paper](https://arxiv.org/abs/2108.06084) Appendix A.2 we show that it is also necessary to change the LR decay to token-based (to avoid decaying LR too fast). Thus we add a `--lr-decay-tokens` which will be the number of LR decay tokens. If previously you were using `--lr-decay-samples`, you can calculate your `--lr-decay-tokens` simply by multiplying the former by full seqlen (e.g. 2K for GPT-3). If `--lr-decay-tokens` is given, it will override `--lr-decay-samples` so you can keep both in the script. For LR warmup we don't change it to token-based, because doing so for CL means slowing down the LR warmup, which is both unnecessary and harmful.
+Again because CL changes the number of tokens per batch, in our [paper](https://arxiv.org/abs/2108.06084) Appendix A.2 we show that it is also necessary to change the LR decay to token-based (to avoid decaying LR too fast). Thus we add a `--lr-decay-tokens` which will be the number of LR decay tokens. If previously you were using `--lr-decay-samples`, you can calculate your `--lr-decay-tokens` simply by multiplying the former by full seqlen (e.g. 2K for GPT-3). The you need to replace `--lr-decay-samples` with `--lr-decay-tokens` in your script. For LR warmup we don't change it to token-based, because doing so for CL means slowing down the LR warmup, which is both unnecessary and harmful.
 
 # Token-based tensorboard
 

diff --git a/examples/curriculum_learning/pretrain_gpt_cl.sh b/examples/curriculum_learning/pretrain_gpt_cl.sh
@@ -56,7 +56,6 @@ megatron_options=" \
         --adam-beta2 0.95 \
         --tensor-model-parallel-size ${MP_SIZE} \
         --init-method-std 0.014 \
-        --lr-decay-samples ${LR_DECAY_SAMPLES} \
         --lr-decay-tokens ${LR_DECAY_TOKENS} \
         --lr-warmup-samples ${LR_WARMUP_SAMPLES} \
         --micro-batch-size ${MICRO_BATCH_SIZE} \