Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got rpc error if setting up with non-root user #63

Closed
Chenfengldw opened this issue Dec 6, 2019 · 9 comments
Closed

Got rpc error if setting up with non-root user #63

Chenfengldw opened this issue Dec 6, 2019 · 9 comments

Comments

@Chenfengldw
Copy link

I changed the user in parties.conf file, from root to ubuntu, which is a sudo user of my cluster.
I manually create /data/ dir and chmod it to 777. Then I run the deploy script as usual.
The output is OK and the containers are all set up. but it seems rpc call cannot work correctly.

(venv) [root@383b62d82f80 toy_example]# python run_toy_example.py 10000 9999 1
stdout:{
"retcode": 100,
"retmsg": "rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "io exception"\n\tdebug_error_string = "{"created":"@1575627671.035273330","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"io exception","grpc_status":14}"\n>"
}

Traceback (most recent call last):
File "run_toy_example.py", line 196, in
exec_toy_example(runtime_config)
File "run_toy_example.py", line 161, in exec_toy_example
jobid = exec_task(dsl_path, runtime_config)
File "run_toy_example.py", line 91, in exec_task
"failed to exec task, status:{}, stderr is {} stdout:{}".format(status, stderr, stdout))
ValueError: failed to exec task, status:100, stderr is None stdout:{'retcode': 100, 'retmsg': 'rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "io exception"\n\tdebug_error_string = "{"created":"@1575627671.035273330","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"io exception","grpc_status":14}"\n>'}

@jiahaoc1993
Copy link
Contributor

@Chenfengldw Please check if all services are running properly.

@Chenfengldw
Copy link
Author

@jiahaoc1993
All the container has been running

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
afbe6237b517 federatedai/fateboard:1.1.1-release "/bin/sh -c 'cd /dat…" 28 hours ago Up 28 hours 0.0.0.0:8080->8080/tcp confs-9999_fateboard_1
491b84eb4b90 federatedai/python:1.1.1-release "/bin/sh -c 'sleep 5…" 28 hours ago Up 28 hours 0.0.0.0:9360->9360/tcp, 0.0.0.0:9380->9380/tcp confs-9999_python_1
b52c8509ec49 federatedai/roll:1.1.1-release "/bin/sh -c 'cd roll…" 28 hours ago Up 28 hours 8011/tcp confs-9999_roll_1
4733a790f681 federatedai/meta-service:1.1.1-release "/bin/sh -c 'java -c…" 28 hours ago Up 28 hours 8590/tcp confs-9999_meta-service_1
ddfc43b65e5b federatedai/egg:1.1.1-release "/bin/sh -c 'cd /dat…" 28 hours ago Up 28 hours 7778/tcp, 7888/tcp, 50000-60000/tcp confs-9999_egg_1
f8b8d98f3906 federatedai/federation:1.1.1-release "/bin/sh -c 'cd /dat…" 28 hours ago Up 28 hours 9394/tcp confs-9999_federation_1
529c67f6f78b redis:5 "docker-entrypoint.s…" 28 hours ago Up 28 hours 6379/tcp confs-9999_redis_1
76e270e2a190 mysql:8 "docker-entrypoint.s…" 28 hours ago Up 28 hours 3306/tcp, 33060/tcp confs-9999_mysql_1
6289a791b632 federatedai/proxy:1.1.1-release "/bin/sh -c 'cd /dat…" 28 hours ago Up 28 hours 0.0.0.0:9370->9370/tcp

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a13f40127216 federatedai/fateboard:1.1.1-release "/bin/sh -c 'cd /dat…" 29 hours ago Up 29 hours 0.0.0.0:8080->8080/tcp confs-10000_fateboard_1
383b62d82f80 federatedai/python:1.1.1-release "/bin/sh -c 'sleep 5…" 29 hours ago Up 29 hours 0.0.0.0:9360->9360/tcp, 0.0.0.0:9380->9380/tcp confs-10000_python_1
e1934e7e3eb3 federatedai/roll:1.1.1-release "/bin/sh -c 'cd roll…" 29 hours ago Up 29 hours 8011/tcp confs-10000_roll_1
55d60287a711 federatedai/meta-service:1.1.1-release "/bin/sh -c 'java -c…" 29 hours ago Up 29 hours 8590/tcp confs-10000_meta-service_1
101d6802d206 federatedai/egg:1.1.1-release "/bin/sh -c 'cd /dat…" 29 hours ago Up 29 hours 7778/tcp, 7888/tcp, 50000-60000/tcp confs-10000_egg_1
96a355a16f38 federatedai/proxy:1.1.1-release "/bin/sh -c 'cd /dat…" 29 hours ago Up 29 hours 0.0.0.0:9370->9370/tcp confs-10000_proxy_1
33dd514181a0 redis:5 "docker-entrypoint.s…" 29 hours ago Up 29 hours 6379/tcp confs-10000_redis_1
1342e73a515a federatedai/federation:1.1.1-release "/bin/sh -c 'cd /dat…" 29 hours ago Up 29 hours 9394/tcp confs-10000_federation_1
7a0a2058e685 mysql:8 "docker-entrypoint.s…" 29 hours ago Up 29 hours 3306/tcp, 33060/tcp

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4a83e738317f federatedai/proxy:1.1.1-release "/bin/sh -c 'cd /dat…" 41 seconds ago Up 39 seconds 0.0.0.0:9371->9370/tcp confs-exchange_exchange_1

@jiahaoc1993
Copy link
Contributor

jiahaoc1993 commented Dec 9, 2019

You should check the storage service as well.

@Chenfengldw
Copy link
Author

I just checked the container listed in the doc, could you please give some info about the "storage service"?

@jiahaoc1993
Copy link
Contributor

HI, please follow this link to verify your storage service: #53 .

@Hyberlion
Copy link

I have the same problem as you and has your problem been fixed yet?

@jiahaoc1993
Copy link
Contributor

Please make sure the "$dir" in the target host is existed and owned by the "$user".
$dir and $user are defined in the parties.conf and kube.cfg.

@Chenfengldw
Copy link
Author

I have successfully fixed this issue by using CPU supporting avx2 instruction set.
Please use cat /proc/cpuinfo to make sure that your CPU support avx2.

@jiahaoc1993
Copy link
Contributor

@Chenfengldw good to hear that. close now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants