ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ]. #55

poson · 2021-10-19T01:40:16Z

Instance 20211017182224640gq00ata2 Failed.
FAILED: Failed 20211017182228723gepjc292_b9f620e9_506a_48cd_ba2d_d0dad28d8e24:ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ].

分布式训练阶段，出现这个问题是资源不足？但是我看资源监控，cpu和内存没有占用到100%。
其中hash bucket size 很大，有千万级。
-Dcluster="{"worker":{"count":8,"gpu":0,"cpu":1500,"memory":60000},"ps":{"count":8,"cpu":400,"memory":10000}}"

poson · 2021-10-19T01:43:23Z

hash bucket size 和embedding dimension可以设置小一些，节约内存。

worker的内存超过40G，可能是要等待比较久，也可能分配不到资源。默认30G，一般不超过40G

worker 可以设置9个，8个计算，1一个evaluation
ps 在能装下模型基础上越少越好，cpu多一点，通信快一点。

elif FLAGS.task_index == 1: os.environ['TF_CONFIG'] = json.dumps( {'cluster': cluster, 'task': {'type': "evaluator", 'index': 0}})

应该是worker1 单独做evaluation的。当然test数据太多，每次evaluation会很慢

poson closed this as completed Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ]. #55

ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ]. #55

poson commented Oct 19, 2021 •

edited

Loading

poson commented Oct 19, 2021

ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ]. #55

ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ]. #55

Comments

poson commented Oct 19, 2021 • edited Loading

poson commented Oct 19, 2021

poson commented Oct 19, 2021 •

edited

Loading