-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker try connecting to standby RM (resource manager) and hanging #140
Comments
Please paste all logs and threadstacks |
2017-07-13 10:23:54,482 INFO [pool-8-thread-3] org.ehcache.sizeof.impl.AgentLoader: Agent successfully loaded and available! |
Looks like your HDFS cluster is in an unhealthy state, the log shows HDFS the active NameNode has failedover or dead, and the other NameNode is unreachable.
So not hanging at connect to standby RM, your're should configured namenode and resourcemanager at same node,please check your cluster,include HDFS. |
worker`s thread stack: threadid: 104375812 threadname: IPC Client (969972672) connection to c3-hadoop-prc-ct05.bj/10.108.84.32:11200 from work threadstate: TIMED_WAITING threadid: 104364428 threadname: pool-3-thread-4432 threadstate: TIMED_WAITING threadid: 104350804 threadname: pool-3-thread-4431 threadstate: TIMED_WAITING threadid: 104346404 threadname: pool-3-thread-4430 threadstate: TIMED_WAITING threadid: 104320940 threadname: pool-3-thread-4429 threadstate: TIMED_WAITING threadid: 104318659 threadname: pool-3-thread-4428 threadstate: TIMED_WAITING threadid: 104258557 threadname: pool-3-thread-4427 threadstate: TIMED_WAITING threadid: 104206850 threadname: pool-3-thread-4426 threadstate: TIMED_WAITING threadid: 104197710 threadname: pool-3-thread-4425 threadstate: TIMED_WAITING threadid: 104123041 threadname: pool-3-thread-4424 threadstate: TIMED_WAITING threadid: 104116415 threadname: pool-3-thread-4423 threadstate: TIMED_WAITING threadid: 104015130 threadname: pool-3-thread-4422 threadstate: TIMED_WAITING threadid: 104004769 threadname: pool-3-thread-4421 threadstate: TIMED_WAITING threadid: 103653083 threadname: pool-3-thread-4411 threadstate: TIMED_WAITING threadid: 103470486 threadname: pool-3-thread-4406 threadstate: TIMED_WAITING threadid: 103385858 threadname: pool-3-thread-4404 threadstate: TIMED_WAITING threadid: 103381515 threadname: pool-3-thread-4403 threadstate: TIMED_WAITING threadid: 103252517 threadname: pool-3-thread-4397 threadstate: TIMED_WAITING threadid: 48 threadname: nioEventLoopGroup-2-21 threadstate: RUNNABLE threadid: 49 threadname: nioEventLoopGroup-2-22 threadstate: RUNNABLE threadid: 51 threadname: nioEventLoopGroup-2-24 threadstate: RUNNABLE threadid: 50 threadname: nioEventLoopGroup-2-23 threadstate: RUNNABLE threadid: 1959 threadname: Thread-1823 threadstate: TIMED_WAITING threadid: 1958 threadname: Thread-1825 threadstate: TIMED_WAITING threadid: 1957 threadname: Thread-1824 threadstate: TIMED_WAITING threadid: 1956 threadname: LeaseRenewer:work@c3prc-hadoop threadstate: TIMED_WAITING threadid: 1955 threadname: Thread-1821 threadstate: TIMED_WAITING threadid: 1773 threadname: Attach Listener threadstate: RUNNABLE threadid: 1212 threadname: IPC Parameter Sending Thread #1 threadstate: TIMED_WAITING threadid: 600 threadname: DestroyJavaVM threadstate: RUNNABLE threadid: 599 threadname: pool-8-thread-4 threadstate: TIMED_WAITING threadid: 597 threadname: pool-8-thread-3 threadstate: TIMED_WAITING threadid: 580 threadname: pool-8-thread-2 threadstate: TIMED_WAITING threadid: 573 threadname: pool-8-thread-1 threadstate: TIMED_WAITING threadid: 151 threadname: Worker Heartbeat threadstate: TIMED_WAITING |
We run LRRunner demo with a9a.train data. Bash script is flowing.
#!/bin/bash
./angel-submit
--cluster c3prc-hadoop
--action.type train
--angel.app.submit.class "com.tencent.angel.ml.classification.lr.LRRunner"
--angel.train.data.path "hdfs://c3prc-hadoop/user/h_miui_ad/develop/mengqingdi/dmp_example/sim_train_data/a9a.train"
--angel.log.path "hdfs://c3prc-hadoop/user/h_miui_ad/develop/mengqingdi/dmp_example/sim/log"
--angel.save.model.path "hdfs://c3prc-hadoop/user/h_miui_ad/develop/mengqingdi/dmp_example/sim/model"
--queue root.production.miui_group.miui_ad.queue_algo
--ml.epoch.num 10
--ml.batch.num 10
--ml.feature.num 1024
--ml.validate.ratio 0.1
--ml.data.type libsvm
--ml.learn.rate 1
--ml.learn.decay 0.1
--ml.reg.l2 0
--angel.workergroup.number 50
--angel.worker.memory.gb 10
--angel.worker.task.number 4
--angel.ps.number 20
--angel.ps.memory.gb 5
--angel.job.name LR_SAMPLE
At beginning, iteration goes well. But after a few iterations (more or less) , the work hanging, keep trying connecting to standby RM, the whole computing stuck.
Head log of workerthreadstack is like this,
threadid: 121196 threadname: IPC Client (868889379) connection to c3-hadoop-prc-ct05.bj/10.108.84.32:11200 from work threadstate: TIMED_WAITING
java.lang.Object.wait(Native Method)
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:928)
org.apache.hadoop.ipc.Client$Connection.run(Client.java:973)
PS: our yarn RM configure is
yarn.resourcemanager.ha.enabled true yarn.resourcemanager.ha.rm-ids rm0,rm1 yarn.resourcemanager.resource-tracker.address c3-hadoop-prc-ct04.bj:22303 yarn.resourcemanager.resource-tracker.address.rm0 c3-hadoop-prc-ct04.bj:22303 yarn.resourcemanager.resource-tracker.address.rm1 c3-hadoop-prc-ct05.bj:22303 yarn.resourcemanager.resource-tracker.client.thread-count 50 yarn.resourcemanager.scheduler.address c3-hadoop-prc-ct04.bj:22302 yarn.resourcemanager.scheduler.address.rm0 c3-hadoop-prc-ct04.bj:22302 yarn.resourcemanager.scheduler.address.rm1 c3-hadoop-prc-ct05.bj:22302 yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler yarn.resourcemanager.scheduler.client.thread-count 50
The text was updated successfully, but these errors were encountered: