Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ]. #55

Closed
poson opened this issue Oct 19, 2021 · 1 comment

Comments

@poson
Copy link
Collaborator

poson commented Oct 19, 2021

Instance 20211017182224640gq00ata2 Failed.
FAILED: Failed 20211017182228723gepjc292_b9f620e9_506a_48cd_ba2d_d0dad28d8e24:ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ].

分布式训练阶段,出现这个问题是资源不足?但是我看资源监控,cpu和内存没有占用到100%。
其中hash bucket size 很大,有千万级。
-Dcluster="{"worker":{"count":8,"gpu":0,"cpu":1500,"memory":60000},"ps":{"count":8,"cpu":400,"memory":10000}}"

@poson
Copy link
Collaborator Author

poson commented Oct 19, 2021

hash bucket size 和embedding dimension可以设置小一些,节约内存。

worker的内存超过40G,可能是要等待比较久,也可能分配不到资源。默认30G,一般不超过40G

worker 可以设置9个,8个计算,1一个evaluation
ps 在能装下模型基础上越少越好,cpu多一点,通信快一点。

elif FLAGS.task_index == 1: os.environ['TF_CONFIG'] = json.dumps( {'cluster': cluster, 'task': {'type': "evaluator", 'index': 0}})

应该是worker1 单独做evaluation的。 当然test数据太多,每次evaluation会很慢

@poson poson closed this as completed Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant