Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CosineLRScheduler 没有t_mul 参数 #9

Open
verigle opened this issue Nov 30, 2022 · 31 comments
Open

CosineLRScheduler 没有t_mul 参数 #9

verigle opened this issue Nov 30, 2022 · 31 comments

Comments

@verigle
Copy link

verigle commented Nov 30, 2022

Traceback (most recent call last):
  File "train.py", line 869, in <module>
    main(args)
  File "train.py", line 831, in main
    lr_scheduler = build_scheduler(lr_args, optimizer, len(train_dataloader))
  File "/media/localhost/E/projects/github/multi-modal/vision-language/DDCap/lr_scheduler.py", line 26, in build_scheduler
    lr_scheduler = CosineLRScheduler(
TypeError: __init__() got an unexpected keyword argument 't_mul'

请问这个字段是否写错了,正确的应该是什么?

@buxiangzhiren
Copy link
Owner

你的timm的版本是多少啊?

@buxiangzhiren
Copy link
Owner

因为CosineLRScheduler是从timm导入的,系统报告多了一个参数,所以这里应该是timm版本不对。

@verigle
Copy link
Author

verigle commented Dec 1, 2022

我是用的最新版的timm(timm==0.3.4 版本会不兼容)
timm 0.6.12

@buxiangzhiren
Copy link
Owner

解决了吗?是能装0.3.4的吧?

@verigle
Copy link
Author

verigle commented Dec 3, 2022

没有解决, 0.3.4 能 安装
但运行时会报错

@verigle
Copy link
Author

verigle commented Dec 3, 2022

我把版本换成了
timm 0.4.12

可以运行了,但第5个epoch 开始 loss 全都变成了nan

>>> Evaling epoch 5
caption_diff_vitb16_test: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [07:23<00:00,  5.61s/it]
loading annotations into memory...
0:00:00.765627
creating index...
index created!
Loading and preparing results...     
DONE (t=0.02s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 307085 tokens at 1389392.45 tokens per second.
PTBTokenizer tokenized 9999 tokens at 332602.31 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 0, 'reflen': 42485, 'guess': [0, 0, 0, 0], 'correct': [0, 0, 0, 0]}
ratio: 2.3537719195009454e-20
Bleu_1: 0.000
Bleu_2: 0.000
Bleu_3: 0.000
Bleu_4: 0.000
computing METEOR score...
METEOR: 0.000
computing Rouge score...
ROUGE_L: 0.000
computing CIDEr score...
CIDEr: 0.000
computing SPICE score...
Parsing reference captions
Parsing test captions
SPICE evaluation took: 3.000 s
SPICE: 0.000
{'Bleu_1': 0.0, 'Bleu_2': 0.0, 'Bleu_3': 0.0, 'Bleu_4': 0.0, 'METEOR': 0.0, 'ROUGE_L': 0.0, 'CIDEr': 0.0, 'SPICE': 0.0}
{'vl_loss_0': nan, 'vl_loss_1': nan, 'vl_loss_2': nan, 'vl_loss_3': nan, 'vl_loss_4': nan, 'vl_loss_5': nan, 'vl_loss_6': nan, 'vl_loss_7': nan, 'vl_loss_8': nan, 'vl_loss_9': nan, 'vl_loss_10': nan, 'vl_loss_11': nan, 'vl_loss_12': nan, 'vl_loss_13': nan, 'vl_loss_14': nan, 'vl_loss_15': nan, 'vl_loss_16': nan, 'vl_loss_17': nan, 'vl_loss_18': nan, 'vl_loss_19': nan}

请问这个 和timm 版本有关系吗?

@buxiangzhiren
Copy link
Owner

buxiangzhiren commented Dec 3, 2022

你就用0.3.4,是timm的helpers文件报错,你改一下就行
image

@buxiangzhiren
Copy link
Owner

我把版本换成了 timm 0.4.12

可以运行了,但第5个epoch 开始 loss 全都变成了nan

>>> Evaling epoch 5
caption_diff_vitb16_test: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [07:23<00:00,  5.61s/it]
loading annotations into memory...
0:00:00.765627
creating index...
index created!
Loading and preparing results...     
DONE (t=0.02s)
creating index...
index created!
tokenization...
PTBTokenizer tokenized 307085 tokens at 1389392.45 tokens per second.
PTBTokenizer tokenized 9999 tokens at 332602.31 tokens per second.
setting up scorers...
computing Bleu score...
{'testlen': 0, 'reflen': 42485, 'guess': [0, 0, 0, 0], 'correct': [0, 0, 0, 0]}
ratio: 2.3537719195009454e-20
Bleu_1: 0.000
Bleu_2: 0.000
Bleu_3: 0.000
Bleu_4: 0.000
computing METEOR score...
METEOR: 0.000
computing Rouge score...
ROUGE_L: 0.000
computing CIDEr score...
CIDEr: 0.000
computing SPICE score...
Parsing reference captions
Parsing test captions
SPICE evaluation took: 3.000 s
SPICE: 0.000
{'Bleu_1': 0.0, 'Bleu_2': 0.0, 'Bleu_3': 0.0, 'Bleu_4': 0.0, 'METEOR': 0.0, 'ROUGE_L': 0.0, 'CIDEr': 0.0, 'SPICE': 0.0}
{'vl_loss_0': nan, 'vl_loss_1': nan, 'vl_loss_2': nan, 'vl_loss_3': nan, 'vl_loss_4': nan, 'vl_loss_5': nan, 'vl_loss_6': nan, 'vl_loss_7': nan, 'vl_loss_8': nan, 'vl_loss_9': nan, 'vl_loss_10': nan, 'vl_loss_11': nan, 'vl_loss_12': nan, 'vl_loss_13': nan, 'vl_loss_14': nan, 'vl_loss_15': nan, 'vl_loss_16': nan, 'vl_loss_17': nan, 'vl_loss_18': nan, 'vl_loss_19': nan}

请问这个 和timm 版本有关系吗?

这个如果改了timm还存在的话,可能是lr的问题,你现在learning rate和batch size是多少

@verigle
Copy link
Author

verigle commented Dec 3, 2022

learning rate 和batchsize 都是使用的默认值
parser.add_argument('--bs', type=int, default=64)
parser.add_argument('--lr', type=float, default=2e-4)

@verigle
Copy link
Author

verigle commented Dec 3, 2022

你训练时用的参数和 训练日志 可以提供一下吗?

@buxiangzhiren
Copy link
Owner

那应该就是timm的问题。我把log给你

@buxiangzhiren
Copy link
Owner

learning rate 和batchsize 都是使用的默认值 parser.add_argument('--bs', type=int, default=64) parser.add_argument('--lr', type=float, default=2e-4)

你这个是八张卡吗

@verigle
Copy link
Author

verigle commented Dec 3, 2022

我只有一张卡

@buxiangzhiren
Copy link
Owner

那你的batch size这里就不对了,实际上的batch size是 8 * 64 = 512

@buxiangzhiren
Copy link
Owner

但是1张卡应该放不下512,你可以试试bs 256 和 lr 1e-4

@buxiangzhiren
Copy link
Owner

image
image
这个是我当时4张卡 bs 128的参数。

@buxiangzhiren
Copy link
Owner

output.log
这个是log,不过是val dataset上面的结果,并且inference没使用image free,相当于s等于0.

@verigle
Copy link
Author

verigle commented Dec 3, 2022

我目前一张卡 刚好放下 batch size 128, 如果改成256 会放不下

话说为什么学习率要随着batch_size 调整呢?

@buxiangzhiren
Copy link
Owner

bs 128的话可以试下lr 5e-6和1e-4。或者 7e-6啥的。因为不确定这样改对性能影响有多大。

@verigle
Copy link
Author

verigle commented Dec 3, 2022

好的,感谢

@buxiangzhiren
Copy link
Owner

然后我们在实验中发现的是adaptive layernorm加进来对lr非常敏感。在bs 512的时候,lr3e-4就会nan。

@verigle
Copy link
Author

verigle commented Dec 3, 2022

请问有没有试过梯度累积? 如果用梯度累积 能模拟出大的batch_size 一样的效果吗?

@buxiangzhiren
Copy link
Owner

可以,梯度累计是可以的。

@buxiangzhiren
Copy link
Owner

但是我们当时只在pre train的时候试了,这里直接在coco上面train的结论没有。

@verigle
Copy link
Author

verigle commented Dec 4, 2022

请问训练开始前有办法把用到的数据提前下载下来吗?



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/verigle/miniconda3/envs/DDCap/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/verigle/miniconda3/envs/DDCap/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/home/verigle/miniconda3/envs/DDCap/lib/python3.8/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /gpt2/resolve/main/vocab.json (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))

好像网络不稳定时,在第4个epoch 时也会导致训练中断

@buxiangzhiren
Copy link
Owner

应该是gpt2的问题,你可以把gpt2下载下来,然后本地加载gpt2就行

@buxiangzhiren
Copy link
Owner

应该是在tokenizer那里“GPT2Tokenizer.from_pretrained()”

@verigle
Copy link
Author

verigle commented Dec 6, 2022

请问4 张卡 batchsize 128 得到的精度是多少?

image image 这个是我当时4张卡 bs 128的参数。

这个是我一张卡 batchsize 128 , lr=1e-4 取得的结果,和你跑的结果区别大吗?

{'Bleu_1': 77.35163674762245, 'Bleu_2': 60.68409111177853, 'Bleu_3': 46.01830427570451, 'Bleu_4': 34.467262051066285, 'METEOR': 28.135904032357328, 'ROUGE_L': 57.21217470338099, 'CIDEr': 117.40167019143423, 'SPICE': 21.41155617820628}
{'vl_loss_0': 0.003462882013991475, 'vl_loss_1': 0.05225605517625809, 'vl_loss_2': 0.10187061131000519, 'vl_loss_3': 0.17999917268753052, 'vl_loss_4': 0.23890109360218048, 'vl_loss_5': 0.31856250762939453, 'vl_loss_6': 0.404144287109375, 'vl_loss_7': 0.5176785588264465, 'vl_loss_8': 0.6270703077316284, 'vl_loss_9': 0.74876868724823, 'vl_loss_10': 0.9183353185653687, 'vl_loss_11': 1.0719547271728516, 'vl_loss_12': 1.2795639038085938, 'vl_loss_13': 1.4924308061599731, 'vl_loss_14': 1.7697254419326782, 'vl_loss_15': 2.046462297439575, 'vl_loss_16': 2.389587879180908, 'vl_loss_17': 2.783905267715454, 'vl_loss_18': 3.212603807449341, 'vl_loss_19': 3.7237884998321533}

@buxiangzhiren
Copy link
Owner

几乎没有差距,就是117。

@ShyFoo
Copy link

ShyFoo commented Feb 10, 2023

请问4 张卡 batchsize 128 得到的精度是多少?

image image 这个是我当时4张卡 bs 128的参数。

这个是我一张卡 batchsize 128 , lr=1e-4 取得的结果,和你跑的结果区别大吗?

{'Bleu_1': 77.35163674762245, 'Bleu_2': 60.68409111177853, 'Bleu_3': 46.01830427570451, 'Bleu_4': 34.467262051066285, 'METEOR': 28.135904032357328, 'ROUGE_L': 57.21217470338099, 'CIDEr': 117.40167019143423, 'SPICE': 21.41155617820628}
{'vl_loss_0': 0.003462882013991475, 'vl_loss_1': 0.05225605517625809, 'vl_loss_2': 0.10187061131000519, 'vl_loss_3': 0.17999917268753052, 'vl_loss_4': 0.23890109360218048, 'vl_loss_5': 0.31856250762939453, 'vl_loss_6': 0.404144287109375, 'vl_loss_7': 0.5176785588264465, 'vl_loss_8': 0.6270703077316284, 'vl_loss_9': 0.74876868724823, 'vl_loss_10': 0.9183353185653687, 'vl_loss_11': 1.0719547271728516, 'vl_loss_12': 1.2795639038085938, 'vl_loss_13': 1.4924308061599731, 'vl_loss_14': 1.7697254419326782, 'vl_loss_15': 2.046462297439575, 'vl_loss_16': 2.389587879180908, 'vl_loss_17': 2.783905267715454, 'vl_loss_18': 3.212603807449341, 'vl_loss_19': 3.7237884998321533}

你好,请问你能分享一下你的代码吗?我跑作者的代码一直出错。。。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants