Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

事件检测任务的复现 #5

Closed
yc1999 opened this issue Dec 17, 2021 · 6 comments
Closed

事件检测任务的复现 #5

yc1999 opened this issue Dec 17, 2021 · 6 comments

Comments

@yc1999
Copy link

yc1999 commented Dec 17, 2021

Hi~

  1. 想问你在多卡条件下跑过事件检测的代码嘛?我用4张卡跑,报了下面的错误:
Training:   0%|                                                                                                                                                               | 0/1159 [00:22<?, ?it/s]
Traceback (most recent call last):
  File "run_trigger_extraction.py", line 405, in <module>
    main()
  File "run_trigger_extraction.py", line 378, in main
    train(args, model, processor)
  File "run_trigger_extraction.py", line 243, in train
    pred_sub_heads, pred_sub_tails = model(data, add_label_info=add_label_info)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/models/model_event.py", line 635, in forward
    fused_results = self.label_fusing_layer(
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 320, in forward
    return self.get_fused_feature_with_attn(token_embs, label_embs, input_mask, label_input_mask, return_scores=return_scores)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 504, in get_fused_feature_with_attn
    scores = torch.matmul(token_feature_fc, label_feature_t).view(
RuntimeError: shape '[4, 48, 33, -1]' is invalid for input of size 160512
  1. 我用一张卡跑,设置batchsize=2, gradient_accumulation_step=16, 能够跑通,但是得到的结果是train上的loss收敛到了6,dev的f1一直是0。我不知道这只是我的个人问题,还是有其他人也存在这个问题。有童鞋跑出相应的结果了么?

谢谢大家~

@Akeepers
Copy link
Owner

  1. 我记得是卡比较空的时候,写了多卡的代码,是DP的,所以需要padding_to_max,因为后面实验多,卡没有那么多,就没改成DDP,主要是梯度累积

  2. 这个结果有问题,建议自己检查下,注意看这个“Note: The thunlp has updated the repo HMEAE recently, which causing the mismatch of data. Make sure you use the earlier version for ED task.”

@huanghuidmml
Copy link

在事件检测上提升不大,用了HMEAE老版本数据切分后,直接bio标注就能到81以上,甚至82

@Akeepers
Copy link
Owner

@huanghuidmml 这个,我自己是没有跑出来过这么高的结果的

@yc1999
Copy link
Author

yc1999 commented Jan 10, 2022

Hi~

  1. 想问你在多卡条件下跑过事件检测的代码嘛?我用4张卡跑,报了下面的错误:
Training:   0%|                                                                                                                                                               | 0/1159 [00:22<?, ?it/s]
Traceback (most recent call last):
  File "run_trigger_extraction.py", line 405, in <module>
    main()
  File "run_trigger_extraction.py", line 378, in main
    train(args, model, processor)
  File "run_trigger_extraction.py", line 243, in train
    pred_sub_heads, pred_sub_tails = model(data, add_label_info=add_label_info)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/models/model_event.py", line 635, in forward
    fused_results = self.label_fusing_layer(
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 320, in forward
    return self.get_fused_feature_with_attn(token_embs, label_embs, input_mask, label_input_mask, return_scores=return_scores)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 504, in get_fused_feature_with_attn
    scores = torch.matmul(token_feature_fc, label_feature_t).view(
RuntimeError: shape '[4, 48, 33, -1]' is invalid for input of size 160512
  1. 我用一张卡跑,设置batchsize=2, gradient_accumulation_step=16, 能够跑通,但是得到的结果是train上的loss收敛到了6,dev的f1一直是0。我不知道这只是我的个人问题,还是有其他人也存在这个问题。有童鞋跑出相应的结果了么?

谢谢大家~

第二个问题的原因是我的--task_layer_lr 参数设成了2e-4,应该是20才对,原因如下:

optimizer_grouped_parameters = [
{"params": [p for n, p in bert_parameters if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay, 'lr': args.learning_rate},
{"params": [p for n, p in bert_parameters if any(
nd in n for nd in no_decay)], "weight_decay": 0.0, 'lr': args.learning_rate},
{"params": [p for n, p in first_start_params if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay, 'lr': args.learning_rate * args.task_layer_lr},
{"params": [p for n, p in first_start_params if any(
nd in n for nd in no_decay)], "weight_decay": 0.0, 'lr': args.learning_rate * args.task_layer_lr},
{"params": [p for n, p in first_end_params if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay, 'lr': args.learning_rate * args.task_layer_lr},
{"params": [p for n, p in first_end_params if any(
nd in n for nd in no_decay)], "weight_decay": 0.0, 'lr': args.learning_rate * args.task_layer_lr},
]

@Senwang98
Copy link

@yc1999
请问一下,这个ner任务应该怎么配置可以run起来,看得有点懵这么多传参,跑了就报错

@MoDawn
Copy link

MoDawn commented Oct 17, 2022

@yc1999 请问一下,这个ner任务应该怎么配置可以run起来,看得有点懵这么多传参,跑了就报错

请问又跑起来么?我也是不懂这个传参,,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants