Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic DAG failed after changing parallelism for many times #1777

Closed
manuzhang opened this issue Dec 25, 2015 · 6 comments
Closed

Dynamic DAG failed after changing parallelism for many times #1777

manuzhang opened this issue Dec 25, 2015 · 6 comments

Comments

@manuzhang
Copy link
Member

How to reproduce:

Submit SOL through Web UI and continuously changing parallelism SOLStreamProducer from 1 to 2, 3, 4, 5. After 4 or 5 times, SOL will fail itself.

executor log:

[INFO] [12/25/2015 08:57:12.019] [Slf4jLogger] Slf4jLogger started
[INFO] [12/25/2015 08:57:12.078] [Remoting] Starting remoting
[INFO] [12/25/2015 08:57:12.257] [Remoting] Remoting started; listening on addresses :[akka.tcp://app1system9@127.0.0.1:53216]
[INFO] [12/25/2015 08:57:12.297] [Metrics$] Metrics is enabled...,  true
[INFO] [12/25/2015 08:57:12.302] [Daemon@app1system9] RegisterActorSystem to akka.tcp://app1-executor-1@127.0.0.1:59046/user/daemon/appdaemon1/$b/$a/$o, current user: manu
[WARN] [12/25/2015 08:57:12.417] [ReliableDeliverySupervisor] Association with remote system [akka.tcp://app1-executor-1@127.0.0.1:59046] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://app1-executor-1@127.0.0.1:59046]] Caused by: [Connection refused: /127.0.0.1:59046]
[ERROR] [12/25/2015 08:57:37.317] [Daemon@app1system9] RegisterActorSystemFailed
java.util.concurrent.TimeoutException
        at io.gearpump.util.ActorSystemBooter$Daemon.<init>(ActorSystemBooter.scala:108)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at akka.util.Reflect$.instantiate(Reflect.scala:66)
        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)
        at akka.actor.Props.newActor(Props.scala:252)
        at akka.actor.ActorCell.newActor(ActorCell.scala:552)
        at akka.actor.ActorCell.create(ActorCell.scala:578)
        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[INFO] [12/25/2015 08:57:37.320] [Daemon@app1system9] ActorSystem app1system9 is shutting down...
[INFO] [12/25/2015 08:57:37.332] [RemoteActorRefProvider$RemotingTerminator] Shutting down remote daemon.
[INFO] [12/25/2015 08:57:37.334] [RemoteActorRefProvider$RemotingTerminator] Remote daemon shut down; proceeding with flushing remote transports.
[INFO] [12/25/2015 08:57:37.391] [RemoteActorRefProvider$RemotingTerminator] Remoting shut down.
[INFO] [12/25/2015 08:57:37.407] [ActorSystemBooter$] Maybe we have received a SIGINT signal from parent process, start to cleanup resources...

application master log:

[INFO] [12/25/2015 08:56:56.277] [ExecutorSystemLauncher] Launching Executor ...appId: 1, executorSystemId: 8, slots: 2 on worker WorkerInfo(0,Actor[akka.tcp://master@127.0.0.1:3000/user/Worker1#-1056887906])
[INFO] [12/25/2015 08:56:56.278] [ExecutorSystemLauncher] Launching Executor ...appId: 1, executorSystemId: 9, slots: 3 on worker WorkerInfo(1,Actor[akka.tcp://master@127.0.0.1:3000/user/Worker0#-1162316881])
[ERROR] [12/25/2015 08:57:11.278] [ExecutorSystemLauncher] The Executor ActorSystem 8 has not been started in time
[ERROR] [12/25/2015 08:57:11.279] [ExecutorSystemScheduler@app1] Failed to launch executor system for Actor[akka://app1-executor-1/user/daemon/appdaemon1/$c/appmaster/executors#-1755639163] due to timeout
[ERROR] [12/25/2015 08:57:11.279] [AppMaster@app1] Failed to allocate resource in time, shutdown application 1
[ERROR] [12/25/2015 08:57:11.285] [ExecutorSystemLauncher] The Executor ActorSystem 9 has not been started in time
[ERROR] [12/25/2015 08:57:11.285] [ExecutorSystemScheduler@app1] Failed to launch executor system for Actor[akka://app1-executor-1/user/daemon/appdaemon1/$c/appmaster/executors#-1755639163] due to timeout
[INFO] [12/25/2015 08:57:11.505] [ActorSystemBooter$] Maybe we have received a SIGINT signal from parent process, start to cleanup resources....
@huafengw
Copy link
Contributor

Not reproducible on my machine, postpone this issue.

@clockfly
Copy link
Member

@manuzhang

Can you reproduce this? It is possible because of the timeout setting of launching a JVM. Under certain circumstance, like very limited memory and very high CPU usage, it may takes a lot of time to start a new JVM. It may upto tens of seconds.

@clockfly
Copy link
Member

Can you also attach the AppMaster log next time? So that we can compare the timestamp of both.

@manuzhang
Copy link
Member Author

the second log is AppMaster log

@manuzhang
Copy link
Member Author

@clockfly Yes, I can reproduce this in my environment. I agree with your guess because there are a bunch of dead executors still taking up memories after several rounds of DAG changes.

@clockfly
Copy link
Member

To work-around this, I will change the JVM start timeout setting to a bigger value.

@clockfly clockfly added this to the 0.7.5 milestone Jan 19, 2016
clockfly added a commit to clockfly/gearpump that referenced this issue Jan 27, 2016
clockfly added a commit to clockfly/gearpump that referenced this issue Jan 27, 2016
clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016
clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016
clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016
clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016
clockfly added a commit to clockfly/gearpump that referenced this issue Jan 28, 2016
clockfly added a commit to clockfly/gearpump that referenced this issue Jan 29, 2016
manuzhang added a commit that referenced this issue Jan 29, 2016
fix #1777, Dynamic DAG failed after changing parallelism for many times
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants