Dynamic DAG failed after changing parallelism for many times #1777

manuzhang · 2015-12-25T01:05:00Z

How to reproduce:

Submit SOL through Web UI and continuously changing parallelism SOLStreamProducer from 1 to 2, 3, 4, 5. After 4 or 5 times, SOL will fail itself.

executor log:

[INFO] [12/25/2015 08:57:12.019] [Slf4jLogger] Slf4jLogger started
[INFO] [12/25/2015 08:57:12.078] [Remoting] Starting remoting
[INFO] [12/25/2015 08:57:12.257] [Remoting] Remoting started; listening on addresses :[akka.tcp://app1system9@127.0.0.1:53216]
[INFO] [12/25/2015 08:57:12.297] [Metrics$] Metrics is enabled...,  true
[INFO] [12/25/2015 08:57:12.302] [Daemon@app1system9] RegisterActorSystem to akka.tcp://app1-executor-1@127.0.0.1:59046/user/daemon/appdaemon1/$b/$a/$o, current user: manu
[WARN] [12/25/2015 08:57:12.417] [ReliableDeliverySupervisor] Association with remote system [akka.tcp://app1-executor-1@127.0.0.1:59046] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://app1-executor-1@127.0.0.1:59046]] Caused by: [Connection refused: /127.0.0.1:59046]
[ERROR] [12/25/2015 08:57:37.317] [Daemon@app1system9] RegisterActorSystemFailed
java.util.concurrent.TimeoutException
        at io.gearpump.util.ActorSystemBooter$Daemon.<init>(ActorSystemBooter.scala:108)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at akka.util.Reflect$.instantiate(Reflect.scala:66)
        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
        at akka.actor.Props.newActor(Props.scala:252)
        at akka.actor.ActorCell.newActor(ActorCell.scala:552)
        at akka.actor.ActorCell.create(ActorCell.scala:578)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[INFO] [12/25/2015 08:57:37.320] [Daemon@app1system9] ActorSystem app1system9 is shutting down...
[INFO] [12/25/2015 08:57:37.332] [RemoteActorRefProvider$RemotingTerminator] Shutting down remote daemon.
[INFO] [12/25/2015 08:57:37.334] [RemoteActorRefProvider$RemotingTerminator] Remote daemon shut down; proceeding with flushing remote transports.
[INFO] [12/25/2015 08:57:37.391] [RemoteActorRefProvider$RemotingTerminator] Remoting shut down.
[INFO] [12/25/2015 08:57:37.407] [ActorSystemBooter$] Maybe we have received a SIGINT signal from parent process, start to cleanup resources...

application master log:

[INFO] [12/25/2015 08:56:56.277] [ExecutorSystemLauncher] Launching Executor ...appId: 1, executorSystemId: 8, slots: 2 on worker WorkerInfo(0,Actor[akka.tcp://master@127.0.0.1:3000/user/Worker1#-1056887906])
[INFO] [12/25/2015 08:56:56.278] [ExecutorSystemLauncher] Launching Executor ...appId: 1, executorSystemId: 9, slots: 3 on worker WorkerInfo(1,Actor[akka.tcp://master@127.0.0.1:3000/user/Worker0#-1162316881])
[ERROR] [12/25/2015 08:57:11.278] [ExecutorSystemLauncher] The Executor ActorSystem 8 has not been started in time
[ERROR] [12/25/2015 08:57:11.279] [ExecutorSystemScheduler@app1] Failed to launch executor system for Actor[akka://app1-executor-1/user/daemon/appdaemon1/$c/appmaster/executors#-1755639163] due to timeout
[ERROR] [12/25/2015 08:57:11.279] [AppMaster@app1] Failed to allocate resource in time, shutdown application 1
[ERROR] [12/25/2015 08:57:11.285] [ExecutorSystemLauncher] The Executor ActorSystem 9 has not been started in time
[ERROR] [12/25/2015 08:57:11.285] [ExecutorSystemScheduler@app1] Failed to launch executor system for Actor[akka://app1-executor-1/user/daemon/appdaemon1/$c/appmaster/executors#-1755639163] due to timeout
[INFO] [12/25/2015 08:57:11.505] [ActorSystemBooter$] Maybe we have received a SIGINT signal from parent process, start to cleanup resources....

The text was updated successfully, but these errors were encountered:

huafengw · 2015-12-25T05:29:39Z

Not reproducible on my machine, postpone this issue.

clockfly · 2016-01-14T02:01:30Z

@manuzhang

Can you reproduce this? It is possible because of the timeout setting of launching a JVM. Under certain circumstance, like very limited memory and very high CPU usage, it may takes a lot of time to start a new JVM. It may upto tens of seconds.

clockfly · 2016-01-14T02:01:54Z

Can you also attach the AppMaster log next time? So that we can compare the timestamp of both.

manuzhang · 2016-01-14T02:17:20Z

the second log is AppMaster log

manuzhang · 2016-01-14T02:52:34Z

@clockfly Yes, I can reproduce this in my environment. I agree with your guess because there are a bunch of dead executors still taking up memories after several rounds of DAG changes.

clockfly · 2016-01-14T02:55:21Z

To work-around this, I will change the JVM start timeout setting to a bigger value.

…many times

fix #1777, Dynamic DAG failed after changing parallelism for many times

manuzhang added k: bug m: core labels Dec 25, 2015

clockfly added this to the 0.7.5 milestone Jan 19, 2016

clockfly added a commit to clockfly/gearpump that referenced this issue Jan 27, 2016

fix gearpump#1777, Dynamic DAG failed after changing parallelism for …

4a543d8

…many times

clockfly added a commit to clockfly/gearpump that referenced this issue Jan 27, 2016

fix gearpump#1777, Dynamic DAG failed after changing parallelism for …

5120a18

…many times

clockfly mentioned this issue Jan 27, 2016

fix #1777, Dynamic DAG failed after changing parallelism for many times #1931

Closed

clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016

fix gearpump#1777, Dynamic DAG failed after changing parallelism for …

360f9e4

…many times

clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016

fix gearpump#1777, Dynamic DAG failed after changing parallelism for …

3cde8bd

…many times

clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016

fix gearpump#1777, Dynamic DAG failed after changing parallelism for …

2412b3e

…many times

clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016

fix gearpump#1777, Dynamic DAG failed after changing parallelism for …

62efa54

…many times

clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016

fix gearpump#1777, Dynamic DAG failed after changing parallelism for …

9b9c85b

…many times

clockfly added a commit to clockfly/gearpump that referenced this issue Jan 29, 2016

fix gearpump#1777, Dynamic DAG failed after changing parallelism for …

cb5ed24

…many times

manuzhang closed this as completed in fa697bb Jan 29, 2016

manuzhang added a commit that referenced this issue Jan 29, 2016

Merge pull request #1934 from clockfly/issue_1777

2566c0e

fix #1777, Dynamic DAG failed after changing parallelism for many times

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic DAG failed after changing parallelism for many times #1777

Dynamic DAG failed after changing parallelism for many times #1777

manuzhang commented Dec 25, 2015

huafengw commented Dec 25, 2015

clockfly commented Jan 14, 2016

clockfly commented Jan 14, 2016

manuzhang commented Jan 14, 2016

manuzhang commented Jan 14, 2016

clockfly commented Jan 14, 2016

Dynamic DAG failed after changing parallelism for many times #1777

Dynamic DAG failed after changing parallelism for many times #1777

Comments

manuzhang commented Dec 25, 2015

huafengw commented Dec 25, 2015

clockfly commented Jan 14, 2016

clockfly commented Jan 14, 2016

manuzhang commented Jan 14, 2016

manuzhang commented Jan 14, 2016

clockfly commented Jan 14, 2016