We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance 20211017182224640gq00ata2 Failed. FAILED: Failed 20211017182228723gepjc292_b9f620e9_506a_48cd_ba2d_d0dad28d8e24:ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ].
分布式训练阶段,出现这个问题是资源不足?但是我看资源监控,cpu和内存没有占用到100%。 其中hash bucket size 很大,有千万级。 -Dcluster="{"worker":{"count":8,"gpu":0,"cpu":1500,"memory":60000},"ps":{"count":8,"cpu":400,"memory":10000}}"
The text was updated successfully, but these errors were encountered:
hash bucket size 和embedding dimension可以设置小一些,节约内存。
worker的内存超过40G,可能是要等待比较久,也可能分配不到资源。默认30G,一般不超过40G
worker 可以设置9个,8个计算,1一个evaluation ps 在能装下模型基础上越少越好,cpu多一点,通信快一点。
elif FLAGS.task_index == 1: os.environ['TF_CONFIG'] = json.dumps( {'cluster': cluster, 'task': {'type': "evaluator", 'index': 0}})
应该是worker1 单独做evaluation的。 当然test数据太多,每次evaluation会很慢
Sorry, something went wrong.
No branches or pull requests
Instance 20211017182224640gq00ata2 Failed.
FAILED: Failed 20211017182228723gepjc292_b9f620e9_506a_48cd_ba2d_d0dad28d8e24:ODPS-1202005:Algo Job Failed-System Error-Wait over 30min, not enough resource. [ RequsetId: null ].
分布式训练阶段,出现这个问题是资源不足?但是我看资源监控,cpu和内存没有占用到100%。
其中hash bucket size 很大,有千万级。
-Dcluster="{"worker":{"count":8,"gpu":0,"cpu":1500,"memory":60000},"ps":{"count":8,"cpu":400,"memory":10000}}"
The text was updated successfully, but these errors were encountered: