CosineLRScheduler 没有t_mul 参数 #9

verigle · 2022-11-30T17:00:38Z

Traceback (most recent call last):
  File "train.py", line 869, in <module>
    main(args)
  File "train.py", line 831, in main
    lr_scheduler = build_scheduler(lr_args, optimizer, len(train_dataloader))
  File "/media/localhost/E/projects/github/multi-modal/vision-language/DDCap/lr_scheduler.py", line 26, in build_scheduler
    lr_scheduler = CosineLRScheduler(
TypeError: __init__() got an unexpected keyword argument 't_mul'

请问这个字段是否写错了，正确的应该是什么？

The text was updated successfully, but these errors were encountered:

buxiangzhiren · 2022-11-30T17:04:34Z

你的timm的版本是多少啊？

buxiangzhiren · 2022-11-30T17:14:08Z

因为CosineLRScheduler是从timm导入的，系统报告多了一个参数，所以这里应该是timm版本不对。

verigle · 2022-12-01T04:42:27Z

我是用的最新版的timm(timm==0.3.4 版本会不兼容)
timm 0.6.12

buxiangzhiren · 2022-12-01T16:49:24Z

解决了吗？是能装0.3.4的吧？

verigle · 2022-12-03T16:13:32Z

没有解决， 0.3.4 能安装
但运行时会报错

verigle · 2022-12-03T16:17:37Z

我把版本换成了
timm 0.4.12

可以运行了，但第5个epoch 开始 loss 全都变成了nan

>>> Evaling epoch 5
caption_diff_vitb16_test: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [07:23<00:00,  5.61s/it]
loading annotations into memory...
0:00:00.765627
creating index...
index created!
Loading and preparing results...     
DONE (t=0.02s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 307085 tokens at 1389392.45 tokens per second.
PTBTokenizer tokenized 9999 tokens at 332602.31 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 0, 'reflen': 42485, 'guess': [0, 0, 0, 0], 'correct': [0, 0, 0, 0]}
ratio: 2.3537719195009454e-20
Bleu_1: 0.000
Bleu_2: 0.000
Bleu_3: 0.000
Bleu_4: 0.000
computing METEOR score...
METEOR: 0.000
computing Rouge score...
ROUGE_L: 0.000
computing CIDEr score...
CIDEr: 0.000
computing SPICE score...
Parsing reference captions
Parsing test captions
SPICE evaluation took: 3.000 s
SPICE: 0.000
{'Bleu_1': 0.0, 'Bleu_2': 0.0, 'Bleu_3': 0.0, 'Bleu_4': 0.0, 'METEOR': 0.0, 'ROUGE_L': 0.0, 'CIDEr': 0.0, 'SPICE': 0.0}
{'vl_loss_0': nan, 'vl_loss_1': nan, 'vl_loss_2': nan, 'vl_loss_3': nan, 'vl_loss_4': nan, 'vl_loss_5': nan, 'vl_loss_6': nan, 'vl_loss_7': nan, 'vl_loss_8': nan, 'vl_loss_9': nan, 'vl_loss_10': nan, 'vl_loss_11': nan, 'vl_loss_12': nan, 'vl_loss_13': nan, 'vl_loss_14': nan, 'vl_loss_15': nan, 'vl_loss_16': nan, 'vl_loss_17': nan, 'vl_loss_18': nan, 'vl_loss_19': nan}

请问这个和timm 版本有关系吗？

buxiangzhiren · 2022-12-03T16:23:53Z

你就用0.3.4，是timm的helpers文件报错,你改一下就行

buxiangzhiren · 2022-12-03T16:25:35Z

我把版本换成了 timm 0.4.12

可以运行了，但第5个epoch 开始 loss 全都变成了nan

>>> Evaling epoch 5
caption_diff_vitb16_test: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [07:23<00:00,  5.61s/it]
loading annotations into memory...
0:00:00.765627
creating index...
index created!
Loading and preparing results...     
DONE (t=0.02s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 307085 tokens at 1389392.45 tokens per second.
PTBTokenizer tokenized 9999 tokens at 332602.31 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 0, 'reflen': 42485, 'guess': [0, 0, 0, 0], 'correct': [0, 0, 0, 0]}
ratio: 2.3537719195009454e-20
Bleu_1: 0.000
Bleu_2: 0.000
Bleu_3: 0.000
Bleu_4: 0.000
computing METEOR score...
METEOR: 0.000
computing Rouge score...
ROUGE_L: 0.000
computing CIDEr score...
CIDEr: 0.000
computing SPICE score...
Parsing reference captions
Parsing test captions
SPICE evaluation took: 3.000 s
SPICE: 0.000
{'Bleu_1': 0.0, 'Bleu_2': 0.0, 'Bleu_3': 0.0, 'Bleu_4': 0.0, 'METEOR': 0.0, 'ROUGE_L': 0.0, 'CIDEr': 0.0, 'SPICE': 0.0}
{'vl_loss_0': nan, 'vl_loss_1': nan, 'vl_loss_2': nan, 'vl_loss_3': nan, 'vl_loss_4': nan, 'vl_loss_5': nan, 'vl_loss_6': nan, 'vl_loss_7': nan, 'vl_loss_8': nan, 'vl_loss_9': nan, 'vl_loss_10': nan, 'vl_loss_11': nan, 'vl_loss_12': nan, 'vl_loss_13': nan, 'vl_loss_14': nan, 'vl_loss_15': nan, 'vl_loss_16': nan, 'vl_loss_17': nan, 'vl_loss_18': nan, 'vl_loss_19': nan}

请问这个和timm 版本有关系吗？

这个如果改了timm还存在的话，可能是lr的问题，你现在learning rate和batch size是多少

verigle · 2022-12-03T16:31:48Z

learning rate 和batchsize 都是使用的默认值
parser.add_argument('--bs', type=int, default=64)
parser.add_argument('--lr', type=float, default=2e-4)

verigle · 2022-12-03T16:32:37Z

你训练时用的参数和训练日志可以提供一下吗？

buxiangzhiren · 2022-12-03T16:33:22Z

那应该就是timm的问题。我把log给你

buxiangzhiren · 2022-12-03T16:39:11Z

learning rate 和batchsize 都是使用的默认值 parser.add_argument('--bs', type=int, default=64) parser.add_argument('--lr', type=float, default=2e-4)

你这个是八张卡吗

verigle · 2022-12-03T16:39:32Z

我只有一张卡

buxiangzhiren · 2022-12-03T16:40:41Z

那你的batch size这里就不对了，实际上的batch size是 8 * 64 = 512

buxiangzhiren · 2022-12-03T16:41:32Z

但是1张卡应该放不下512，你可以试试bs 256 和 lr 1e-4

buxiangzhiren · 2022-12-03T16:44:13Z

这个是我当时4张卡 bs 128的参数。

buxiangzhiren · 2022-12-03T16:45:35Z

output.log
这个是log，不过是val dataset上面的结果，并且inference没使用image free，相当于s等于0.

verigle · 2022-12-03T16:46:34Z

我目前一张卡刚好放下 batch size 128, 如果改成256 会放不下

话说为什么学习率要随着batch_size 调整呢？

buxiangzhiren · 2022-12-03T16:47:43Z

我目前一张卡刚好放下 batch size 128, 如果改成256 会放不下

话说为什么学习率要随着batch_size 调整呢？

https://zhuanlan.zhihu.com/p/64864995#:~:text=%E9%80%9A%E5%B8%B8%E5%BD%93%E6%88%91%E4%BB%AC%E5%A2%9E%E5%8A%A0batchsize%E4%B8%BA%E5%8E%9F%E6%9D%A5%E7%9A%84N%E5%80%8D%E6%97%B6%EF%BC%8C%E8%A6%81%E4%BF%9D%E8%AF%81%E7%BB%8F%E8%BF%87%E5%90%8C%E6%A0%B7%E7%9A%84%E6%A0%B7%E6%9C%AC%E5%90%8E%E6%9B%B4%E6%96%B0%E7%9A%84%E6%9D%83%E9%87%8D%E7%9B%B8%E7%AD%89%EF%BC%8C%E6%8C%89%E7%85%A7%E7%BA%BF%E6%80%A7%E7%BC%A9%E6%94%BE%E8%A7%84%E5%88%99%EF%BC%8C%E5%AD%A6%E4%B9%A0%E7%8E%87%E5%BA%94%E8%AF%A5%E5%A2%9E%E5%8A%A0%E4%B8%BA%E5%8E%9F%E6%9D%A5%E7%9A%84N%E5%80%8D%20%E3%80%82,%E4%BD%86%E6%98%AF%E5%A6%82%E6%9E%9C%E8%A6%81%E4%BF%9D%E8%AF%81%E6%9D%83%E9%87%8D%E7%9A%84%E6%96%B9%E5%B7%AE%E4%B8%8D%E5%8F%98%EF%BC%8C%E5%88%99%E5%AD%A6%E4%B9%A0%E7%8E%87%E5%BA%94%E8%AF%A5%E5%A2%9E%E5%8A%A0%E4%B8%BA%E5%8E%9F%E6%9D%A5%E7%9A%84sqrt%20%28N%29%E5%80%8D%20%EF%BC%8C%E7%9B%AE%E5%89%8D%E8%BF%99%E4%B8%A4%E7%A7%8D%E7%AD%96%E7%95%A5%E9%83%BD%E8%A2%AB%E7%A0%94%E7%A9%B6%E8%BF%87%EF%BC%8C%E4%BD%BF%E7%94%A8%E5%89%8D%E8%80%85%E7%9A%84%E6%98%8E%E6%98%BE%E5%B1%85%E5%A4%9A%E3%80%82

buxiangzhiren · 2022-12-03T16:49:34Z

bs 128的话可以试下lr 5e-6和1e-4。或者 7e-6啥的。因为不确定这样改对性能影响有多大。

verigle · 2022-12-03T16:50:40Z

好的，感谢

buxiangzhiren · 2022-12-03T16:51:23Z

然后我们在实验中发现的是adaptive layernorm加进来对lr非常敏感。在bs 512的时候，lr3e-4就会nan。

verigle · 2022-12-03T16:53:13Z

请问有没有试过梯度累积？如果用梯度累积能模拟出大的batch_size 一样的效果吗？

buxiangzhiren · 2022-12-03T16:53:46Z

可以，梯度累计是可以的。

buxiangzhiren · 2022-12-03T16:54:36Z

但是我们当时只在pre train的时候试了，这里直接在coco上面train的结论没有。

verigle · 2022-12-04T13:03:36Z

请问训练开始前有办法把用到的数据提前下载下来吗？



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/verigle/miniconda3/envs/DDCap/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/verigle/miniconda3/envs/DDCap/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/home/verigle/miniconda3/envs/DDCap/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /gpt2/resolve/main/vocab.json (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))

好像网络不稳定时，在第4个epoch 时也会导致训练中断

buxiangzhiren · 2022-12-04T14:15:49Z

应该是gpt2的问题，你可以把gpt2下载下来，然后本地加载gpt2就行

buxiangzhiren · 2022-12-04T14:17:21Z

应该是在tokenizer那里“GPT2Tokenizer.from_pretrained()”

verigle · 2022-12-06T16:42:25Z

请问4 张卡 batchsize 128 得到的精度是多少？

这个是我当时4张卡 bs 128的参数。

这个是我一张卡 batchsize 128 , lr=1e-4 取得的结果，和你跑的结果区别大吗？

{'Bleu_1': 77.35163674762245, 'Bleu_2': 60.68409111177853, 'Bleu_3': 46.01830427570451, 'Bleu_4': 34.467262051066285, 'METEOR': 28.135904032357328, 'ROUGE_L': 57.21217470338099, 'CIDEr': 117.40167019143423, 'SPICE': 21.41155617820628}
{'vl_loss_0': 0.003462882013991475, 'vl_loss_1': 0.05225605517625809, 'vl_loss_2': 0.10187061131000519, 'vl_loss_3': 0.17999917268753052, 'vl_loss_4': 0.23890109360218048, 'vl_loss_5': 0.31856250762939453, 'vl_loss_6': 0.404144287109375, 'vl_loss_7': 0.5176785588264465, 'vl_loss_8': 0.6270703077316284, 'vl_loss_9': 0.74876868724823, 'vl_loss_10': 0.9183353185653687, 'vl_loss_11': 1.0719547271728516, 'vl_loss_12': 1.2795639038085938, 'vl_loss_13': 1.4924308061599731, 'vl_loss_14': 1.7697254419326782, 'vl_loss_15': 2.046462297439575, 'vl_loss_16': 2.389587879180908, 'vl_loss_17': 2.783905267715454, 'vl_loss_18': 3.212603807449341, 'vl_loss_19': 3.7237884998321533}

buxiangzhiren · 2022-12-06T16:44:50Z

几乎没有差距，就是117。

ShyFoo · 2023-02-10T14:08:13Z

请问4 张卡 batchsize 128 得到的精度是多少？

这个是我当时4张卡 bs 128的参数。

这个是我一张卡 batchsize 128 , lr=1e-4 取得的结果，和你跑的结果区别大吗？

{'Bleu_1': 77.35163674762245, 'Bleu_2': 60.68409111177853, 'Bleu_3': 46.01830427570451, 'Bleu_4': 34.467262051066285, 'METEOR': 28.135904032357328, 'ROUGE_L': 57.21217470338099, 'CIDEr': 117.40167019143423, 'SPICE': 21.41155617820628}
{'vl_loss_0': 0.003462882013991475, 'vl_loss_1': 0.05225605517625809, 'vl_loss_2': 0.10187061131000519, 'vl_loss_3': 0.17999917268753052, 'vl_loss_4': 0.23890109360218048, 'vl_loss_5': 0.31856250762939453, 'vl_loss_6': 0.404144287109375, 'vl_loss_7': 0.5176785588264465, 'vl_loss_8': 0.6270703077316284, 'vl_loss_9': 0.74876868724823, 'vl_loss_10': 0.9183353185653687, 'vl_loss_11': 1.0719547271728516, 'vl_loss_12': 1.2795639038085938, 'vl_loss_13': 1.4924308061599731, 'vl_loss_14': 1.7697254419326782, 'vl_loss_15': 2.046462297439575, 'vl_loss_16': 2.389587879180908, 'vl_loss_17': 2.783905267715454, 'vl_loss_18': 3.212603807449341, 'vl_loss_19': 3.7237884998321533}

你好，请问你能分享一下你的代码吗？我跑作者的代码一直出错。。。

zhangfujunaaa mentioned this issue Mar 31, 2024

为什么我的一直卡在这里RANK and WORLD_SIZE in environ: 0/1 #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CosineLRScheduler 没有t_mul 参数 #9

CosineLRScheduler 没有t_mul 参数 #9

verigle commented Nov 30, 2022

buxiangzhiren commented Nov 30, 2022

buxiangzhiren commented Nov 30, 2022

verigle commented Dec 1, 2022

buxiangzhiren commented Dec 1, 2022

verigle commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022 •

edited

Loading

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 4, 2022

buxiangzhiren commented Dec 4, 2022

buxiangzhiren commented Dec 4, 2022

verigle commented Dec 6, 2022

buxiangzhiren commented Dec 6, 2022

ShyFoo commented Feb 10, 2023

CosineLRScheduler 没有t_mul 参数 #9

CosineLRScheduler 没有t_mul 参数 #9

Comments

verigle commented Nov 30, 2022

buxiangzhiren commented Nov 30, 2022

buxiangzhiren commented Nov 30, 2022

verigle commented Dec 1, 2022

buxiangzhiren commented Dec 1, 2022

verigle commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022 • edited Loading

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

buxiangzhiren commented Dec 3, 2022

verigle commented Dec 4, 2022

buxiangzhiren commented Dec 4, 2022

buxiangzhiren commented Dec 4, 2022

verigle commented Dec 6, 2022

buxiangzhiren commented Dec 6, 2022

ShyFoo commented Feb 10, 2023

buxiangzhiren commented Dec 3, 2022 •

edited

Loading