事件检测任务的复现 #5

yc1999 · 2021-12-17T12:23:08Z

Hi~

想问你在多卡条件下跑过事件检测的代码嘛？我用4张卡跑，报了下面的错误：

Training:   0%|                                                                                                                                                               | 0/1159 [00:22<?, ?it/s]
Traceback (most recent call last):
  File "run_trigger_extraction.py", line 405, in <module>
    main()
  File "run_trigger_extraction.py", line 378, in main
    train(args, model, processor)
  File "run_trigger_extraction.py", line 243, in train
    pred_sub_heads, pred_sub_tails = model(data, add_label_info=add_label_info)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/models/model_event.py", line 635, in forward
    fused_results = self.label_fusing_layer(
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 320, in forward
    return self.get_fused_feature_with_attn(token_embs, label_embs, input_mask, label_input_mask, return_scores=return_scores)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 504, in get_fused_feature_with_attn
    scores = torch.matmul(token_feature_fc, label_feature_t).view(
RuntimeError: shape '[4, 48, 33, -1]' is invalid for input of size 160512

我用一张卡跑，设置batchsize=2, gradient_accumulation_step=16, 能够跑通，但是得到的结果是train上的loss收敛到了6，dev的f1一直是0。我不知道这只是我的个人问题，还是有其他人也存在这个问题。有童鞋跑出相应的结果了么？

谢谢大家~

Akeepers · 2021-12-17T14:37:10Z

我记得是卡比较空的时候，写了多卡的代码，是DP的，所以需要padding_to_max，因为后面实验多，卡没有那么多，就没改成DDP，主要是梯度累积
这个结果有问题，建议自己检查下，注意看这个“Note: The thunlp has updated the repo HMEAE recently, which causing the mismatch of data. Make sure you use the earlier version for ED task.”

huanghuidmml · 2021-12-17T15:01:49Z

在事件检测上提升不大，用了HMEAE老版本数据切分后，直接bio标注就能到81以上，甚至82

Akeepers · 2021-12-20T06:04:17Z

@huanghuidmml 这个，我自己是没有跑出来过这么高的结果的

yc1999 · 2022-01-10T03:09:12Z

Hi~

想问你在多卡条件下跑过事件检测的代码嘛？我用4张卡跑，报了下面的错误：

Training:   0%|                                                                                                                                                               | 0/1159 [00:22<?, ?it/s]
Traceback (most recent call last):
  File "run_trigger_extraction.py", line 405, in <module>
    main()
  File "run_trigger_extraction.py", line 378, in main
    train(args, model, processor)
  File "run_trigger_extraction.py", line 243, in train
    pred_sub_heads, pred_sub_tails = model(data, add_label_info=add_label_info)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/models/model_event.py", line 635, in forward
    fused_results = self.label_fusing_layer(
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 320, in forward
    return self.get_fused_feature_with_attn(token_embs, label_embs, input_mask, label_input_mask, return_scores=return_scores)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 504, in get_fused_feature_with_attn
    scores = torch.matmul(token_feature_fc, label_feature_t).view(
RuntimeError: shape '[4, 48, 33, -1]' is invalid for input of size 160512

我用一张卡跑，设置batchsize=2, gradient_accumulation_step=16, 能够跑通，但是得到的结果是train上的loss收敛到了6，dev的f1一直是0。我不知道这只是我的个人问题，还是有其他人也存在这个问题。有童鞋跑出相应的结果了么？

谢谢大家~

第二个问题的原因是我的--task_layer_lr 参数设成了2e-4，应该是20才对，原因如下：

LEAR/run_trigger_extraction.py

Lines 160 to 173 in 8ae3ed0

    
           optimizer_grouped_parameters = [ 
        
               {"params": [p for n, p in bert_parameters if not any(nd in n for nd in no_decay)], 
        
                "weight_decay": args.weight_decay, 'lr': args.learning_rate}, 
        
               {"params": [p for n, p in bert_parameters if any( 
        
                   nd in n for nd in no_decay)], "weight_decay": 0.0, 'lr': args.learning_rate}, 
        
               {"params": [p for n, p in first_start_params if not any(nd in n for nd in no_decay)], 
        
                "weight_decay": args.weight_decay, 'lr': args.learning_rate * args.task_layer_lr}, 
        
               {"params": [p for n, p in first_start_params if any( 
        
                   nd in n for nd in no_decay)], "weight_decay": 0.0, 'lr': args.learning_rate * args.task_layer_lr}, 
        
               {"params": [p for n, p in first_end_params if not any(nd in n for nd in no_decay)], 
        
                "weight_decay": args.weight_decay, 'lr': args.learning_rate * args.task_layer_lr}, 
        
               {"params": [p for n, p in first_end_params if any( 
        
                   nd in n for nd in no_decay)], "weight_decay": 0.0, 'lr': args.learning_rate * args.task_layer_lr}, 
        
           ]

Senwang98 · 2022-08-04T06:18:22Z

@yc1999
请问一下，这个ner任务应该怎么配置可以run起来，看得有点懵这么多传参，跑了就报错

MoDawn · 2022-10-17T08:44:39Z

@yc1999 请问一下，这个ner任务应该怎么配置可以run起来，看得有点懵这么多传参，跑了就报错

请问又跑起来么？我也是不懂这个传参，，

Akeepers closed this as completed Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

事件检测任务的复现 #5

事件检测任务的复现 #5

yc1999 commented Dec 17, 2021

Akeepers commented Dec 17, 2021

huanghuidmml commented Dec 17, 2021

Akeepers commented Dec 20, 2021

yc1999 commented Jan 10, 2022

Senwang98 commented Aug 4, 2022

MoDawn commented Oct 17, 2022

事件检测任务的复现 #5

事件检测任务的复现 #5

Comments

yc1999 commented Dec 17, 2021

Akeepers commented Dec 17, 2021

huanghuidmml commented Dec 17, 2021

Akeepers commented Dec 20, 2021

yc1999 commented Jan 10, 2022

Senwang98 commented Aug 4, 2022

MoDawn commented Oct 17, 2022