New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to submit Topology on Yarn Mode #1349

Open
HosiYuki opened this Issue Sep 7, 2016 · 42 comments

Comments

Projects
None yet
10 participants
@HosiYuki

HosiYuki commented Sep 7, 2016

I use Yarn as Scheduler, following the instructions on http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/

But I failed to submit Topology. The error is as follows:

cheng@node18-10:~/.heron/examples$ heron submit yarn heron-examples.jar com.twitter.heron.e
xamples.AckingTopology AckingTopology
INFO: Launching topology 'AckingTopology'
[2016-09-07 16:04:47 +0800] com.twitter.heron.scheduler.SubmitterMain SEVERE:  Failed to instantiate instances 
java.lang.ClassNotFoundException: com.twitter.heron.scheduler.yarn.YarnLauncher
        at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at com.twitter.heron.spi.utils.ReflectionUtils.newInstance(ReflectionUtils.java:31)
        at com.twitter.heron.spi.utils.ReflectionUtils.newInstance(ReflectionUtils.java:25)
        at com.twitter.heron.scheduler.SubmitterMain.submitTopology(SubmitterMain.java:370)
        at com.twitter.heron.scheduler.SubmitterMain.main(SubmitterMain.java:315)

Exception in thread "main" java.lang.RuntimeException: Failed to submit topology AckingTopology
        at com.twitter.heron.scheduler.SubmitterMain.main(SubmitterMain.java:319)
ERROR: Failed to launch topology 'AckingTopology' because User main failed with status 1. Bailing out...
INFO: Elapsed time: 0.530s.

@objmagic objmagic added the question label Sep 7, 2016

@billonahill

This comment has been minimized.

Show comment
Hide comment
@billonahill

billonahill Sep 7, 2016

Contributor

What version of Heron are you running? YarnLauncher was added in 0.14.1.

Contributor

billonahill commented Sep 7, 2016

What version of Heron are you running? YarnLauncher was added in 0.14.1.

@mycFelix

This comment has been minimized.

Show comment
Hide comment
@mycFelix

mycFelix Sep 9, 2016

Contributor

Would you like to show your Heron version and more configurations?

Contributor

mycFelix commented Sep 9, 2016

Would you like to show your Heron version and more configurations?

@kramasamy

This comment has been minimized.

Show comment
Hide comment
@kramasamy

kramasamy Sep 9, 2016

Contributor

to show heron version, do the following

cmdline> heron version
Contributor

kramasamy commented Sep 9, 2016

to show heron version, do the following

cmdline> heron version
@kramasamy

This comment has been minimized.

Show comment
Hide comment
@kramasamy

kramasamy Sep 9, 2016

Contributor

@ashvina - can help with this?

Contributor

kramasamy commented Sep 9, 2016

@ashvina - can help with this?

@HosiYuki

This comment has been minimized.

Show comment
Hide comment
@HosiYuki

HosiYuki Sep 14, 2016

@billonahill @mycFelix @kramasamy
Thank you for your replies.
My heron version was 0.14.0 originally. I have installed 0.14.2 recently, and the error disappeared.

But new errors come up:

cheng@node18-10:~$ heron submit yarn .heron/examples/heron-examples.jar com.twitter.heron.examples.AckingTopology AckingTopology
INFO: Using config file under /home/cheng/.heron/conf/yarn
INFO: Launching topology 'AckingTopology'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/cheng/.heron/lib/scheduler/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/cheng/.heron/lib/statemgr/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Starting client to: 10.107.18.210:1210
2016-09-14 10:25:17,194 INFO [main] imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start(224)) - Starting
2016-09-14 10:25:17,204 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2016-09-14 10:25:17,205 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:host.name=node18-10.pdl.net
2016-09-14 10:25:17,205 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.version=1.8.0_11
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.vendor=Oracle Corporation
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.home=/home/cheng/jdk1.8.0_11/jre
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.class.path=
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.io.tmpdir=/tmp
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.compiler=
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.name=Linux
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.arch=amd64
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.version=3.13.0-24-generic
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.name=cheng
2016-09-14 10:25:17,207 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.home=/home/cheng
2016-09-14 10:25:17,207 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.dir=/home/cheng
2016-09-14 10:25:17,208 INFO [main] zookeeper.ZooKeeper (ZooKeeper.java:(438)) - Initiating client connection, connectString=10.107.18.210:1210 sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@4d826d77
2016-09-14 10:25:17,224 INFO [main-SendThread(10.107.18.210:1210)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server 10.107.18.210/10.107.18.210:1210. Will not attempt to authenticate using SASL (unknown error)
2016-09-14 10:25:17,232 INFO [main-SendThread(10.107.18.210:1210)] zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to 10.107.18.210/10.107.18.210:1210, initiating session
2016-09-14 10:25:17,259 INFO [main-SendThread(10.107.18.210:1210)] zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server 10.107.18.210/10.107.18.210:1210, sessionid = 0x1572287bedc0016, negotiated timeout = 30000
2016-09-14 10:25:17,267 INFO [main-EventThread] state.ConnectionStateManager (ConnectionStateManager.java:postState(228)) - State change: CONNECTED
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Topologies directory: /heron/topologies
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Tmaster location directory: /heron/tmasters
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Physical plan directory: /heron/pplans
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Execution state directory: /heron/executionstate
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Scheduler location directory: /heron/schedulers
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the CuratorClient to: 10.107.18.210:1210
2016-09-14 10:25:17,291 INFO [main] zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x1572287bedc0016 closed
2016-09-14 10:25:17,291 INFO [main-EventThread] zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the tunnel processes
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.curator.framework.CuratorFramework.createContainers(Ljava/lang/String;)V
at com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager.initTree(CuratorStateManager.java:130)
at com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager.initialize(CuratorStateManager.java:94)
at com.twitter.heron.scheduler.SubmitterMain.submitTopology(SubmitterMain.java:380)
at com.twitter.heron.scheduler.SubmitterMain.main(SubmitterMain.java:315)
ERROR: Failed to launch topology 'AckingTopology' because User main failed with status 1. Bailing out...
Traceback (most recent call last):
File "/home/cheng/bin/heron/heron/cli/src/python/submit.py", line 145, in launch_topologies
launch_a_topology(cl_args, tmp_dir, topology_file, defn_file)
File "/home/cheng/bin/heron/heron/cli/src/python/submit.py", line 110, in launch_a_topology
java_defines=[]
File "/home/cheng/bin/heron/heron/cli/src/python/execute.py", line 68, in heron_class
raise RuntimeError(err_str)
RuntimeError: User main failed with status 1. Bailing out...
INFO: Elapsed time: 0.861s.

The statemgr.yaml is :

#local state manager class for managing state in a persistent fashion
heron.class.state.manager: com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager
#local state manager connection string
heron.statemgr.connection.string: "10.107.18.210:1210"
#path of the root address to store the state in a local file system
heron.statemgr.root.path: "/heron"
#create the zookeeper nodes, if they do not exist
heron.statemgr.zookeeper.is.initialize.tree: True
#timeout in ms to wait before considering zookeeper session is dead
heron.statemgr.zookeeper.session.timeout.ms: 30000
#timeout in ms to wait before considering zookeeper connection is dead
heron.statemgr.zookeeper.connection.timeout.ms: 30000
#timeout in ms to wait before considering zookeeper connection is dead
heron.statemgr.zookeeper.retry.count: 10
#duration of time to wait until the next retry
heron.statemgr.zookeeper.retry.interval.ms: 10000

And the other config files is default. Zookeeper runs in mode standalone.

HosiYuki commented Sep 14, 2016

@billonahill @mycFelix @kramasamy
Thank you for your replies.
My heron version was 0.14.0 originally. I have installed 0.14.2 recently, and the error disappeared.

But new errors come up:

cheng@node18-10:~$ heron submit yarn .heron/examples/heron-examples.jar com.twitter.heron.examples.AckingTopology AckingTopology
INFO: Using config file under /home/cheng/.heron/conf/yarn
INFO: Launching topology 'AckingTopology'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/cheng/.heron/lib/scheduler/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/cheng/.heron/lib/statemgr/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Starting client to: 10.107.18.210:1210
2016-09-14 10:25:17,194 INFO [main] imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start(224)) - Starting
2016-09-14 10:25:17,204 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
2016-09-14 10:25:17,205 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:host.name=node18-10.pdl.net
2016-09-14 10:25:17,205 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.version=1.8.0_11
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.vendor=Oracle Corporation
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.home=/home/cheng/jdk1.8.0_11/jre
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.class.path=
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.io.tmpdir=/tmp
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:java.compiler=
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.name=Linux
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.arch=amd64
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:os.version=3.13.0-24-generic
2016-09-14 10:25:17,206 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.name=cheng
2016-09-14 10:25:17,207 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.home=/home/cheng
2016-09-14 10:25:17,207 INFO [main] zookeeper.ZooKeeper (Environment.java:logEnv(100)) - Client environment:user.dir=/home/cheng
2016-09-14 10:25:17,208 INFO [main] zookeeper.ZooKeeper (ZooKeeper.java:(438)) - Initiating client connection, connectString=10.107.18.210:1210 sessionTimeout=30000 watcher=org.apache.curator.ConnectionState@4d826d77
2016-09-14 10:25:17,224 INFO [main-SendThread(10.107.18.210:1210)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to server 10.107.18.210/10.107.18.210:1210. Will not attempt to authenticate using SASL (unknown error)
2016-09-14 10:25:17,232 INFO [main-SendThread(10.107.18.210:1210)] zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(852)) - Socket connection established to 10.107.18.210/10.107.18.210:1210, initiating session
2016-09-14 10:25:17,259 INFO [main-SendThread(10.107.18.210:1210)] zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1235)) - Session establishment complete on server 10.107.18.210/10.107.18.210:1210, sessionid = 0x1572287bedc0016, negotiated timeout = 30000
2016-09-14 10:25:17,267 INFO [main-EventThread] state.ConnectionStateManager (ConnectionStateManager.java:postState(228)) - State change: CONNECTED
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Topologies directory: /heron/topologies
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Tmaster location directory: /heron/tmasters
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Physical plan directory: /heron/pplans
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Execution state directory: /heron/executionstate
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Scheduler location directory: /heron/schedulers
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the CuratorClient to: 10.107.18.210:1210
2016-09-14 10:25:17,291 INFO [main] zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x1572287bedc0016 closed
2016-09-14 10:25:17,291 INFO [main-EventThread] zookeeper.ClientCnxn (ClientCnxn.java:run(512)) - EventThread shut down
[2016-09-14 10:25:17 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the tunnel processes
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.curator.framework.CuratorFramework.createContainers(Ljava/lang/String;)V
at com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager.initTree(CuratorStateManager.java:130)
at com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager.initialize(CuratorStateManager.java:94)
at com.twitter.heron.scheduler.SubmitterMain.submitTopology(SubmitterMain.java:380)
at com.twitter.heron.scheduler.SubmitterMain.main(SubmitterMain.java:315)
ERROR: Failed to launch topology 'AckingTopology' because User main failed with status 1. Bailing out...
Traceback (most recent call last):
File "/home/cheng/bin/heron/heron/cli/src/python/submit.py", line 145, in launch_topologies
launch_a_topology(cl_args, tmp_dir, topology_file, defn_file)
File "/home/cheng/bin/heron/heron/cli/src/python/submit.py", line 110, in launch_a_topology
java_defines=[]
File "/home/cheng/bin/heron/heron/cli/src/python/execute.py", line 68, in heron_class
raise RuntimeError(err_str)
RuntimeError: User main failed with status 1. Bailing out...
INFO: Elapsed time: 0.861s.

The statemgr.yaml is :

#local state manager class for managing state in a persistent fashion
heron.class.state.manager: com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager
#local state manager connection string
heron.statemgr.connection.string: "10.107.18.210:1210"
#path of the root address to store the state in a local file system
heron.statemgr.root.path: "/heron"
#create the zookeeper nodes, if they do not exist
heron.statemgr.zookeeper.is.initialize.tree: True
#timeout in ms to wait before considering zookeeper session is dead
heron.statemgr.zookeeper.session.timeout.ms: 30000
#timeout in ms to wait before considering zookeeper connection is dead
heron.statemgr.zookeeper.connection.timeout.ms: 30000
#timeout in ms to wait before considering zookeeper connection is dead
heron.statemgr.zookeeper.retry.count: 10
#duration of time to wait until the next retry
heron.statemgr.zookeeper.retry.interval.ms: 10000

And the other config files is default. Zookeeper runs in mode standalone.

@mycFelix

This comment has been minimized.

Show comment
Hide comment
@mycFelix

mycFelix Sep 14, 2016

Contributor

@HosiYuki - Hi~~~

YarnLauncher was added in 0.14.1 as @billonahill said.

User should copy hadoop-lib jars to specified path in 0.14.2 following the instructions on http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/ .

I highly recommend you use 0.14.3 version. PR #1245 added extra-launch-classpath arg,which means we dont need to copy hadoop-lib jars any more to submit topo on Yarn mode after 0.14.3 version released.
Submitting command will be:

heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --extra-launch-classpath <extra-classpath-value>

No matter which version of Heron you use, there is something you should pay attention to if you want to submit topo to Yarn(Hadoop 2.7.x).

For localfs-statemgr

  • The common-cli jar’s version should be greater than or equal to 1.3.1.

For zookeeper-statemgr

  • The common-cli jar’s version should be greater than or equal to 1.3.1.
  • The curator-framework jar's version should be greater than or equal to 2.9.0
  • The curator-client jar's version should be greater than or equal to 2.9.0

Since perhaps the most popular Hadoop version is 2.7.x , which uses common-cli-1.2.1,curator-framework-2.7.1 and curator-client-2.7.1.

The YARN scheduler doc should be updated for the latest version later.

Contributor

mycFelix commented Sep 14, 2016

@HosiYuki - Hi~~~

YarnLauncher was added in 0.14.1 as @billonahill said.

User should copy hadoop-lib jars to specified path in 0.14.2 following the instructions on http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/ .

I highly recommend you use 0.14.3 version. PR #1245 added extra-launch-classpath arg,which means we dont need to copy hadoop-lib jars any more to submit topo on Yarn mode after 0.14.3 version released.
Submitting command will be:

heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --extra-launch-classpath <extra-classpath-value>

No matter which version of Heron you use, there is something you should pay attention to if you want to submit topo to Yarn(Hadoop 2.7.x).

For localfs-statemgr

  • The common-cli jar’s version should be greater than or equal to 1.3.1.

For zookeeper-statemgr

  • The common-cli jar’s version should be greater than or equal to 1.3.1.
  • The curator-framework jar's version should be greater than or equal to 2.9.0
  • The curator-client jar's version should be greater than or equal to 2.9.0

Since perhaps the most popular Hadoop version is 2.7.x , which uses common-cli-1.2.1,curator-framework-2.7.1 and curator-client-2.7.1.

The YARN scheduler doc should be updated for the latest version later.

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Sep 14, 2016

Contributor

@mycFelix Your answer is complete and accurate. Thanks !
Would you like to contribute to this detail to the YARN scheduler doc?

Contributor

ashvina commented Sep 14, 2016

@mycFelix Your answer is complete and accurate. Thanks !
Would you like to contribute to this detail to the YARN scheduler doc?

@mycFelix

This comment has been minimized.

Show comment
Hide comment
@mycFelix

mycFelix Sep 14, 2016

Contributor

@ashvina - I'm afraid my english problem...... but I'd like to try

Contributor

mycFelix commented Sep 14, 2016

@ashvina - I'm afraid my english problem...... but I'd like to try

@kramasamy

This comment has been minimized.

Show comment
Hide comment
@kramasamy

kramasamy Sep 15, 2016

Contributor

@HosiYuki - Does @mycFelix answer work for you?

Contributor

kramasamy commented Sep 15, 2016

@HosiYuki - Does @mycFelix answer work for you?

@HosiYuki

This comment has been minimized.

Show comment
Hide comment
@HosiYuki

HosiYuki Sep 19, 2016

@mycFelix Thank you for your enthusiastic help!And I’ve learned a lot.
But I still have some doubts.
1、What should the extra-launch-classpath be set to? hadoop classpath or something else? Besides, the extra-launch-classpath arg couldn't handle strings with '*', as your another issue said. How can I deal with the situation with the current 0.14.3 version?
2、How to update the common-cli and curator jars to higher version? Will it be ok that download higher version jars and replace the lower ones? Or should I update them through other ways?
Thanks again for your patient help!

HosiYuki commented Sep 19, 2016

@mycFelix Thank you for your enthusiastic help!And I’ve learned a lot.
But I still have some doubts.
1、What should the extra-launch-classpath be set to? hadoop classpath or something else? Besides, the extra-launch-classpath arg couldn't handle strings with '*', as your another issue said. How can I deal with the situation with the current 0.14.3 version?
2、How to update the common-cli and curator jars to higher version? Will it be ok that download higher version jars and replace the lower ones? Or should I update them through other ways?
Thanks again for your patient help!

@HosiYuki

This comment has been minimized.

Show comment
Hide comment
@HosiYuki

HosiYuki Sep 19, 2016

@kramasamy
Thank you for your attention!
His answer helps lot. But there are still problems to operate normally.

HosiYuki commented Sep 19, 2016

@kramasamy
Thank you for your attention!
His answer helps lot. But there are still problems to operate normally.

@kramasamy

This comment has been minimized.

Show comment
Hide comment
@kramasamy

kramasamy Sep 19, 2016

Contributor

@HosiYuki - the situation * is fixed in the master. common-cli and curator jars are already baked into other jars. @maosongfu - can you please comment?

Contributor

kramasamy commented Sep 19, 2016

@HosiYuki - the situation * is fixed in the master. common-cli and curator jars are already baked into other jars. @maosongfu - can you please comment?

@mycFelix

This comment has been minimized.

Show comment
Hide comment
@mycFelix

mycFelix Sep 19, 2016

Contributor

@HosiYuki - Hi

About first question
As @kramasamy said, the situation '*' is fixed by #1373 . And the submitting command will be:

heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --extra-launch-classpath <extra-classpath-value>

And the key point is the extra-classpath-value. It should be the hadoop-lib-jar path if you want to submit topology to YARN. If you are familiar with the hadoop working environment, you can put all jars in an independent directory such as '$hadoop-lib-jars' , and the submitting command will be:

heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --extra-launch-classpath $hadoop-lib-jars 

And there is an easier way to use --extra-launch-classpath arg by following steps.

  • Finding your hadoop classpath by following command.
> hadoop classpath
  • Making sure every single path is existed

Then the example submitting command in my working environment will be:

heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology \
--extra-launch-classpath /home/hdfs/heron-yarn-classpath-jars/*:${HADOOP_DIR_HOME}/etc/hadoop:${HADOOP_DIR_HOME}/share/hadoop/common/lib/*:${HADOOP_DIR_HOME}/share/hadoop/common/*:${HADOOP_DIR_HOME}/share/hadoop/hdfs:${HADOOP_DIR_HOME}/share/hadoop/hdfs/lib/*:${HADOOP_DIR_HOME}/share/hadoop/hdfs/*:${HADOOP_DIR_HOME}/share/hadoop/yarn/lib/*:${HADOOP_DIR_HOME}/share/hadoop/yarn/*:${HADOOP_DIR_HOME}/share/hadoop/mapreduce/lib/*:${HADOOP_DIR_HOME}/share/hadoop/mapreduce/*

About second question
Please notice the command I gave above. I downloaded the jars I need and put them to the /home/hdfs/heron-yarn-classpath-jars/ directory and invoked them first.

> ll /home/hdfs/heron-yarn-classpath-jars/*
-rw-r--r-- 1 hdfs hdfs  52988 Jun  14 2015 /home/hdfs/heron-yarn-classpath-jars/commons-cli-1.3.1.jar
-rw-r--r-- 1 hdfs hdfs  71909 Sep  13 14:06 /home/hdfs/heron-yarn-classpath-jars/curator-client-2.9.0.jar
-rw-r--r-- 1 hdfs hdfs 192090 Sep  12 16:57 /home/hdfs/heron-yarn-classpath-jars/curator-framework-2.9.1.jar

Above all, I think it should work! Let me know if you have any other question.

Contributor

mycFelix commented Sep 19, 2016

@HosiYuki - Hi

About first question
As @kramasamy said, the situation '*' is fixed by #1373 . And the submitting command will be:

heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --extra-launch-classpath <extra-classpath-value>

And the key point is the extra-classpath-value. It should be the hadoop-lib-jar path if you want to submit topology to YARN. If you are familiar with the hadoop working environment, you can put all jars in an independent directory such as '$hadoop-lib-jars' , and the submitting command will be:

heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --extra-launch-classpath $hadoop-lib-jars 

And there is an easier way to use --extra-launch-classpath arg by following steps.

  • Finding your hadoop classpath by following command.
> hadoop classpath
  • Making sure every single path is existed

Then the example submitting command in my working environment will be:

heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology \
--extra-launch-classpath /home/hdfs/heron-yarn-classpath-jars/*:${HADOOP_DIR_HOME}/etc/hadoop:${HADOOP_DIR_HOME}/share/hadoop/common/lib/*:${HADOOP_DIR_HOME}/share/hadoop/common/*:${HADOOP_DIR_HOME}/share/hadoop/hdfs:${HADOOP_DIR_HOME}/share/hadoop/hdfs/lib/*:${HADOOP_DIR_HOME}/share/hadoop/hdfs/*:${HADOOP_DIR_HOME}/share/hadoop/yarn/lib/*:${HADOOP_DIR_HOME}/share/hadoop/yarn/*:${HADOOP_DIR_HOME}/share/hadoop/mapreduce/lib/*:${HADOOP_DIR_HOME}/share/hadoop/mapreduce/*

About second question
Please notice the command I gave above. I downloaded the jars I need and put them to the /home/hdfs/heron-yarn-classpath-jars/ directory and invoked them first.

> ll /home/hdfs/heron-yarn-classpath-jars/*
-rw-r--r-- 1 hdfs hdfs  52988 Jun  14 2015 /home/hdfs/heron-yarn-classpath-jars/commons-cli-1.3.1.jar
-rw-r--r-- 1 hdfs hdfs  71909 Sep  13 14:06 /home/hdfs/heron-yarn-classpath-jars/curator-client-2.9.0.jar
-rw-r--r-- 1 hdfs hdfs 192090 Sep  12 16:57 /home/hdfs/heron-yarn-classpath-jars/curator-framework-2.9.1.jar

Above all, I think it should work! Let me know if you have any other question.

@HosiYuki

This comment has been minimized.

Show comment
Hide comment
@HosiYuki

HosiYuki Sep 22, 2016

@mycFelix Thank you for your detailed guidance!
Now, I can submit topo to yarn cluster and find out the topo through hadoop http and heron ui. But the topo doesn't seem to execute correctly. According to heron ui, there is no container for the topo.
On the Heron node, there is processes named SubmitterMain. And REEFLauncher on the AM node. Except these, there seems to be no process related to heron topo.
Part of the output as follows:

com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Created node for path: /heron/executionstate/ExclamationTopology
[2016-09-22 08:16:56 +0800] org.apache.hadoop.util.NativeCodeLoader WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Powered by
___________ ______ ______ _______
/ ______ / / / / / / __/
/ / / / / / / /

/ /\ \ / / / / / /
/ / \ \ / /
/ /
/ /
/**/ **\ /
/ /
/ /
/
[2016-09-22 08:16:57 +0800] org.apache.reef.util.REEFVersion INFO: REEF Version: 0.14.0
[2016-09-22 08:16:57 +0800] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers INFO: Initializing REEF client handlers for Heron, topology: ExclamationTopology
[2016-09-22 08:16:57 +0800] org.apache.hadoop.yarn.client.RMProxy INFO: Connecting to ResourceManager at node18-10.pdl.net/10.107.18.210:8032
[2016-09-22 08:17:00 +0800] org.apache.reef.runtime.common.files.JobJarMaker WARNING: Failed to delete [/tmp/reef-job-8823727808098291604]
[2016-09-22 08:17:02 +0800] org.apache.reef.runtime.yarn.client.YarnSubmissionHelper INFO: Submitting REEF Application to YARN. ID: application_1474441925257_0003
[2016-09-22 08:17:02 +0800] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl INFO: Submitted application application_1474441925257_0003
[2016-09-22 08:17:05 +0800] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers INFO: Topology ExclamationTopology is running, jobId ExclamationTopology.
[2016-09-22 08:17:05 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the CuratorClient to: 10.107.18.210:1210
16/09/22 08:17:05 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
16/09/22 08:17:05 INFO zookeeper.ZooKeeper: Session: 0x1573c238041001b closed
16/09/22 08:17:05 INFO zookeeper.ClientCnxn: EventThread shut down
[2016-09-22 08:17:05 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the tunnel processes

HosiYuki commented Sep 22, 2016

@mycFelix Thank you for your detailed guidance!
Now, I can submit topo to yarn cluster and find out the topo through hadoop http and heron ui. But the topo doesn't seem to execute correctly. According to heron ui, there is no container for the topo.
On the Heron node, there is processes named SubmitterMain. And REEFLauncher on the AM node. Except these, there seems to be no process related to heron topo.
Part of the output as follows:

com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Created node for path: /heron/executionstate/ExclamationTopology
[2016-09-22 08:16:56 +0800] org.apache.hadoop.util.NativeCodeLoader WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Powered by
___________ ______ ______ _______
/ ______ / / / / / / __/
/ / / / / / / /

/ /\ \ / / / / / /
/ / \ \ / /
/ /
/ /
/**/ **\ /
/ /
/ /
/
[2016-09-22 08:16:57 +0800] org.apache.reef.util.REEFVersion INFO: REEF Version: 0.14.0
[2016-09-22 08:16:57 +0800] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers INFO: Initializing REEF client handlers for Heron, topology: ExclamationTopology
[2016-09-22 08:16:57 +0800] org.apache.hadoop.yarn.client.RMProxy INFO: Connecting to ResourceManager at node18-10.pdl.net/10.107.18.210:8032
[2016-09-22 08:17:00 +0800] org.apache.reef.runtime.common.files.JobJarMaker WARNING: Failed to delete [/tmp/reef-job-8823727808098291604]
[2016-09-22 08:17:02 +0800] org.apache.reef.runtime.yarn.client.YarnSubmissionHelper INFO: Submitting REEF Application to YARN. ID: application_1474441925257_0003
[2016-09-22 08:17:02 +0800] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl INFO: Submitted application application_1474441925257_0003
[2016-09-22 08:17:05 +0800] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers INFO: Topology ExclamationTopology is running, jobId ExclamationTopology.
[2016-09-22 08:17:05 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the CuratorClient to: 10.107.18.210:1210
16/09/22 08:17:05 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
16/09/22 08:17:05 INFO zookeeper.ZooKeeper: Session: 0x1573c238041001b closed
16/09/22 08:17:05 INFO zookeeper.ClientCnxn: EventThread shut down
[2016-09-22 08:17:05 +0800] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the tunnel processes

@mycFelix

This comment has been minimized.

Show comment
Hide comment
@mycFelix

mycFelix Sep 22, 2016

Contributor

@HosiYuki - Would you like to check the YARN logs, following the instructions on here ,Log File location section.

Contributor

mycFelix commented Sep 22, 2016

@HosiYuki - Would you like to check the YARN logs, following the instructions on here ,Log File location section.

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Sep 22, 2016

Contributor

@HosiYuki as @mycFelix mentioned, logs would be helpful. In addition check the following

  1. libunwind and python 7 installed? I.e. the topology should be deployable in local mode on all yarn nodes.
  2. if you are using local state manager (single node YARN), the root path config must specifie absolute home path. For e.g {HOME} should be replaced with /home/userDir in heron.statemgr.root.path: ${HOME}/.herondata/repository/state/${CLUSTER}. Default location of config file ~/.heron/conf/yarn/statemgr.yaml
Contributor

ashvina commented Sep 22, 2016

@HosiYuki as @mycFelix mentioned, logs would be helpful. In addition check the following

  1. libunwind and python 7 installed? I.e. the topology should be deployable in local mode on all yarn nodes.
  2. if you are using local state manager (single node YARN), the root path config must specifie absolute home path. For e.g {HOME} should be replaced with /home/userDir in heron.statemgr.root.path: ${HOME}/.herondata/repository/state/${CLUSTER}. Default location of config file ~/.heron/conf/yarn/statemgr.yaml
@HosiYuki

This comment has been minimized.

Show comment
Hide comment
@HosiYuki

HosiYuki Sep 22, 2016

@mycFelix @ashvina
I find the drive.stderr and evaluator.err, but don't find the log-files.
Part of drive.stderr is as follows:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "IPC Client (1997548433) connection to node18-10.pdl.net/10.107.18.210:8030 from cheng"
16/09/22 09:05:42 WARN nio.NioEventLoop: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
16/09/22 09:05:42 WARN concurrent.SingleThreadEventExecutor: Unexpected exception from an event executor:
java.lang.OutOfMemoryError: Java heap space
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:478)
at io.netty.util.internal.logging.Slf4JLogger.warn(Slf4JLogger.java:151)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:367)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Exception: java.lang.RuntimeException thrown from the UncaughtExceptionHandler in thread "server-timer"

And part of evaluator.err is as follows:

INFO: Entering REEFLauncher.main().
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/io/DatumReader
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2658)
at java.lang.Class.getDeclaredConstructors(Class.java:2007)
at org.apache.reef.tang.util.ReflectionUtilities.getNamedParameterTargetOrNull(ReflectionUtilities.java:311)
at org.apache.reef.tang.implementation.java.ClassHierarchyImpl.buildPathToNode(ClassHierarchyImpl.java:207)
at org.apache.reef.tang.implementation.java.ClassHierarchyImpl.registerClass(ClassHierarchyImpl.java:387)
at org.apache.reef.tang.implementation.java.ClassHierarchyImpl.register(ClassHierarchyImpl.java:331)
at org.apache.reef.tang.implementation.java.ClassHierarchyImpl.getNode(ClassHierarchyImpl.java:257)
at org.apache.reef.tang.implementation.java.InjectorImpl.parseDefaultImplementation(InjectorImpl.java:375)
at org.apache.reef.tang.implementation.java.InjectorImpl.buildInjectionPlan(InjectorImpl.java:449)
at org.apache.reef.tang.implementation.java.InjectorImpl.filterCandidateConstructors(InjectorImpl.java:193)
at org.apache.reef.tang.implementation.java.InjectorImpl.buildClassNodeInjectionPlan(InjectorImpl.java:277)
at org.apache.reef.tang.implementation.java.InjectorImpl.buildInjectionPlan(InjectorImpl.java:452)
at org.apache.reef.tang.implementation.java.InjectorImpl.getInjectionPlan(InjectorImpl.java:472)
at org.apache.reef.tang.implementation.java.InjectorImpl.getInstance(InjectorImpl.java:514)
at org.apache.reef.tang.implementation.java.InjectorImpl.getInstance(InjectorImpl.java:533)
at org.apache.reef.runtime.common.REEFLauncher.getREEFLauncher(REEFLauncher.java:106)
at org.apache.reef.runtime.common.REEFLauncher.main(REEFLauncher.java:167)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.io.DatumReader
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more

Topologies are deployable in local mode. And I use zookeeper as statemgr. The statemgr.yaml is as follows:

#local state manager class for managing state in a persistent fashion
heron.class.state.manager: com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager
#local state manager connection string
heron.statemgr.connection.string: "10.107.18.210:1210"
#path of the root address to store the state in a local file system
heron.statemgr.root.path: "/heron"
#create the zookeeper nodes, if they do not exist
heron.statemgr.zookeeper.is.initialize.tree: True
#timeout in ms to wait before considering zookeeper session is dead
heron.statemgr.zookeeper.session.timeout.ms: 30000
#timeout in ms to wait before considering zookeeper connection is dead
heron.statemgr.zookeeper.connection.timeout.ms: 30000
#timeout in ms to wait before considering zookeeper connection is dead
heron.statemgr.zookeeper.retry.count: 10
#duration of time to wait until the next retry
heron.statemgr.zookeeper.retry.interval.ms: 10000

HosiYuki commented Sep 22, 2016

@mycFelix @ashvina
I find the drive.stderr and evaluator.err, but don't find the log-files.
Part of drive.stderr is as follows:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "IPC Client (1997548433) connection to node18-10.pdl.net/10.107.18.210:8030 from cheng"
16/09/22 09:05:42 WARN nio.NioEventLoop: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
16/09/22 09:05:42 WARN concurrent.SingleThreadEventExecutor: Unexpected exception from an event executor:
java.lang.OutOfMemoryError: Java heap space
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:478)
at io.netty.util.internal.logging.Slf4JLogger.warn(Slf4JLogger.java:151)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:367)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Exception: java.lang.RuntimeException thrown from the UncaughtExceptionHandler in thread "server-timer"

And part of evaluator.err is as follows:

INFO: Entering REEFLauncher.main().
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/avro/io/DatumReader
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2658)
at java.lang.Class.getDeclaredConstructors(Class.java:2007)
at org.apache.reef.tang.util.ReflectionUtilities.getNamedParameterTargetOrNull(ReflectionUtilities.java:311)
at org.apache.reef.tang.implementation.java.ClassHierarchyImpl.buildPathToNode(ClassHierarchyImpl.java:207)
at org.apache.reef.tang.implementation.java.ClassHierarchyImpl.registerClass(ClassHierarchyImpl.java:387)
at org.apache.reef.tang.implementation.java.ClassHierarchyImpl.register(ClassHierarchyImpl.java:331)
at org.apache.reef.tang.implementation.java.ClassHierarchyImpl.getNode(ClassHierarchyImpl.java:257)
at org.apache.reef.tang.implementation.java.InjectorImpl.parseDefaultImplementation(InjectorImpl.java:375)
at org.apache.reef.tang.implementation.java.InjectorImpl.buildInjectionPlan(InjectorImpl.java:449)
at org.apache.reef.tang.implementation.java.InjectorImpl.filterCandidateConstructors(InjectorImpl.java:193)
at org.apache.reef.tang.implementation.java.InjectorImpl.buildClassNodeInjectionPlan(InjectorImpl.java:277)
at org.apache.reef.tang.implementation.java.InjectorImpl.buildInjectionPlan(InjectorImpl.java:452)
at org.apache.reef.tang.implementation.java.InjectorImpl.getInjectionPlan(InjectorImpl.java:472)
at org.apache.reef.tang.implementation.java.InjectorImpl.getInstance(InjectorImpl.java:514)
at org.apache.reef.tang.implementation.java.InjectorImpl.getInstance(InjectorImpl.java:533)
at org.apache.reef.runtime.common.REEFLauncher.getREEFLauncher(REEFLauncher.java:106)
at org.apache.reef.runtime.common.REEFLauncher.main(REEFLauncher.java:167)
Caused by: java.lang.ClassNotFoundException: org.apache.avro.io.DatumReader
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more

Topologies are deployable in local mode. And I use zookeeper as statemgr. The statemgr.yaml is as follows:

#local state manager class for managing state in a persistent fashion
heron.class.state.manager: com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager
#local state manager connection string
heron.statemgr.connection.string: "10.107.18.210:1210"
#path of the root address to store the state in a local file system
heron.statemgr.root.path: "/heron"
#create the zookeeper nodes, if they do not exist
heron.statemgr.zookeeper.is.initialize.tree: True
#timeout in ms to wait before considering zookeeper session is dead
heron.statemgr.zookeeper.session.timeout.ms: 30000
#timeout in ms to wait before considering zookeeper connection is dead
heron.statemgr.zookeeper.connection.timeout.ms: 30000
#timeout in ms to wait before considering zookeeper connection is dead
heron.statemgr.zookeeper.retry.count: 10
#duration of time to wait until the next retry
heron.statemgr.zookeeper.retry.interval.ms: 10000

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Sep 22, 2016

Contributor

From the logs NoClassDefFoundError: org/apache/avro/io/DatumReader. Seems the avro jar needs to be added to extra-launch-classpath.

Contributor

ashvina commented Sep 22, 2016

From the logs NoClassDefFoundError: org/apache/avro/io/DatumReader. Seems the avro jar needs to be added to extra-launch-classpath.

@HosiYuki

This comment has been minimized.

Show comment
Hide comment
@HosiYuki

HosiYuki Sep 22, 2016

@ashvina
The avro-1.7.4.jar is included. I don't know why org/apache/avro/io/DatumReadercan;t be found.
I changed the avro jar to higher version, but didn't work.

HosiYuki commented Sep 22, 2016

@ashvina
The avro-1.7.4.jar is included. I don't know why org/apache/avro/io/DatumReadercan;t be found.
I changed the avro jar to higher version, but didn't work.

@silence-liu

This comment has been minimized.

Show comment
Hide comment
@silence-liu

silence-liu Sep 23, 2016

image

I submit the topology on heron, the log have no error,but i can't see it on heron-ui and submit command not result success ? what reason?

silence-liu commented Sep 23, 2016

image

I submit the topology on heron, the log have no error,but i can't see it on heron-ui and submit command not result success ? what reason?

@mycFelix

This comment has been minimized.

Show comment
Hide comment
@mycFelix

mycFelix Sep 23, 2016

Contributor

@silence-liu - Would you please to do more checks?

  1. I think the first step you need to check is whether your topology is running well on YARN. Pls check your YARN-scheduler-webstie to confirm your applicationId's status.
  2. If your topology is running well on YARN, then we should focus on the driver.stderr and evaluator.stderr to make sure there is no error while running. You can follow the instructions on http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/ section Log File location
  3. If the step 1 and 2 are both fine. I think what you need to do is to check your .herontools/conf/heron_tracker.yaml configs following the instructions on http://twitter.github.io/heron/docs/operators/heron-tracker/ to make sure the statemgrs is set up right.
Contributor

mycFelix commented Sep 23, 2016

@silence-liu - Would you please to do more checks?

  1. I think the first step you need to check is whether your topology is running well on YARN. Pls check your YARN-scheduler-webstie to confirm your applicationId's status.
  2. If your topology is running well on YARN, then we should focus on the driver.stderr and evaluator.stderr to make sure there is no error while running. You can follow the instructions on http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/ section Log File location
  3. If the step 1 and 2 are both fine. I think what you need to do is to check your .herontools/conf/heron_tracker.yaml configs following the instructions on http://twitter.github.io/heron/docs/operators/heron-tracker/ to make sure the statemgrs is set up right.
@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Sep 24, 2016

Contributor

@HosiYuki The avro ClassNotFound error is coming in evaluator logs. Which means the yarn application classpath may be missing in yarn-site.xml. Could you please check that.

Sample yarn-site config is below

    <property>
        <name>yarn.application.classpath</name>
        <value>
            $HADOOP_HOME/etc/hadoop,
            $HADOOP_HOME/share/hadoop/common/lib/*,
            $HADOOP_HOME/share/hadoop/common/*,
            $HADOOP_HOME/share/hadoop/hdfs,
            $HADOOP_HOME/share/hadoop/hdfs/lib/*,
            $HADOOP_HOME/share/hadoop/hdfs/*,
            $HADOOP_HOME/share/hadoop/yarn/lib/*,
            $HADOOP_HOME/share/hadoop/yarn/*,
            $HADOOP_HOME/share/hadoop/mapreduce/lib/*,
            $HADOOP_HOME/share/hadoop/mapreduce/*,
            $HADOOP_HOME/contrib/capacity-scheduler/*.jar,
            $HADOOP_HOME/share/hadoop/yarn/*,
            $HADOOP_HOME/share/hadoop/yarn/lib/*
        </value>
    </property>
Contributor

ashvina commented Sep 24, 2016

@HosiYuki The avro ClassNotFound error is coming in evaluator logs. Which means the yarn application classpath may be missing in yarn-site.xml. Could you please check that.

Sample yarn-site config is below

    <property>
        <name>yarn.application.classpath</name>
        <value>
            $HADOOP_HOME/etc/hadoop,
            $HADOOP_HOME/share/hadoop/common/lib/*,
            $HADOOP_HOME/share/hadoop/common/*,
            $HADOOP_HOME/share/hadoop/hdfs,
            $HADOOP_HOME/share/hadoop/hdfs/lib/*,
            $HADOOP_HOME/share/hadoop/hdfs/*,
            $HADOOP_HOME/share/hadoop/yarn/lib/*,
            $HADOOP_HOME/share/hadoop/yarn/*,
            $HADOOP_HOME/share/hadoop/mapreduce/lib/*,
            $HADOOP_HOME/share/hadoop/mapreduce/*,
            $HADOOP_HOME/contrib/capacity-scheduler/*.jar,
            $HADOOP_HOME/share/hadoop/yarn/*,
            $HADOOP_HOME/share/hadoop/yarn/lib/*
        </value>
    </property>
@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Sep 24, 2016

Contributor

@silence-liu as @mycFelix mentioned, can you share some details from the yarn user logs ?

Contributor

ashvina commented Sep 24, 2016

@silence-liu as @mycFelix mentioned, can you share some details from the yarn user logs ?

@mycFelix

This comment has been minimized.

Show comment
Hide comment
@mycFelix

mycFelix Sep 24, 2016

Contributor

@ashvina - Shall we add the yarn.application.classpath config into the doc? I'd like to do that

Contributor

mycFelix commented Sep 24, 2016

@ashvina - Shall we add the yarn.application.classpath config into the doc? I'd like to do that

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Sep 24, 2016

Contributor

I think we should not add this to heron doc. Its a very yarn specific
config and is expected to change in future yarn versions.

Contributor

ashvina commented Sep 24, 2016

I think we should not add this to heron doc. Its a very yarn specific
config and is expected to change in future yarn versions.

@kramasamy

This comment has been minimized.

Show comment
Hide comment
@kramasamy

kramasamy Sep 25, 2016

Contributor

@ashvina - is this config a part of scheduler.yaml?

Contributor

kramasamy commented Sep 25, 2016

@ashvina - is this config a part of scheduler.yaml?

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Sep 25, 2016

Contributor

@kramasamy The yarn application classpath is part of yarn config file yarn-site.xml. It is one of the many yarn related configurations like security, queues etc. Similarly many hdfs related configurations belong to hdfs-site.xml and need not be documented for heron's YARN scheduler.

Contributor

ashvina commented Sep 25, 2016

@kramasamy The yarn application classpath is part of yarn config file yarn-site.xml. It is one of the many yarn related configurations like security, queues etc. Similarly many hdfs related configurations belong to hdfs-site.xml and need not be documented for heron's YARN scheduler.

@HosiYuki

This comment has been minimized.

Show comment
Hide comment
@HosiYuki

HosiYuki Sep 26, 2016

@ashvina - Thanks a lot. I modified the yarn site config as you said, and the error was corrected.

However, I met news problems.

The following is info from heron-ExclamationTopology-scheduler.log.0 in log-files :

Sep 26, 2016 2:37:59 PM com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager createNode
INFO: Created node for path: /heron/schedulers/ExclamationTopology
Sep 26, 2016 2:37:59 PM com.twitter.heron.scheduler.SchedulerMain runScheduler
INFO: Waiting for termination...
Sep 26, 2016 2:38:00 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver submitHeronExecutorTask
INFO: Submitting evaluator task for id: 1
Sep 26, 2016 2:38:00 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$HeronWorkerStartHandler onNext
INFO: Task, id:1, has started.
Sep 26, 2016 2:38:05 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager onEvaluatorException
WARNING: Failed evaluator: container_1474871539042_0001_01_000002
org.apache.reef.exception.EvaluatorException: Evaluator [container_1474871539042_0001_01_000002] is assumed to be in state [RUNNING]. But the resource manager reports it to be in state [FAILED]. This means that the Evaluator failed but wasn't able to send an error message back to the driver. Task [0] was running when the Evaluator crashed.
        at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)
        at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)
        at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)
        at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)
        at org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)
        at org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)
        at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)

Sep 26, 2016 2:38:05 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler onNext
WARNING: Container:container_1474871539042_0001_01_000002 failed
Sep 26, 2016 2:38:05 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler onNext
INFO: Trying to relaunch executor 0 running on failed container container_1474871539042_0001_01_000002
Sep 26, 2016 2:38:05 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver allocateContainer
INFO: Requesting container for executor, id: 0, mem: 1,024, cpu: 1
Sep 26, 2016 2:38:06 PM org.apache.reef.wake.impl.ThreadPoolStage close
WARNING: Executor did not terminate in 1000ms.
Sep 26, 2016 2:38:06 PM org.apache.reef.wake.impl.ThreadPoolStage close
WARNING: Executor dropped 0 tasks.
Sep 26, 2016 2:38:06 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler onNext
SEVERE: Failed to relaunch failed container: 0
com.twitter.heron.scheduler.yarn.HeronMasterDriver$ContainerAllocationException: Interrupted while waiting for container
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver.launchContainerForExecutor(HeronMasterDriver.java:233)
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver.access$1900(HeronMasterDriver.java:74)
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler.onNext(HeronMasterDriver.java:467)
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler.onNext(HeronMasterDriver.java:451)
        at org.apache.reef.runtime.common.driver.evaluator.IdlenessCallbackEventHandler.onNext(IdlenessCallbackEventHandler.java:46)
        at org.apache.reef.runtime.common.utils.BroadCastEventHandler.onNext(BroadCastEventHandler.java:40)
        at org.apache.reef.util.ExceptionHandlingEventHandler.onNext(ExceptionHandlingEventHandler.java:46)
        at org.apache.reef.runtime.common.utils.DispatchingEStage$1.onNext(DispatchingEStage.java:68)
        at org.apache.reef.runtime.common.utils.DispatchingEStage$1.onNext(DispatchingEStage.java:65)
        at org.apache.reef.wake.impl.ThreadPoolStage$1.run(ThreadPoolStage.java:182)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver.launchContainerForExecutor(HeronMasterDriver.java:231)
        ... 14 more

Besides, there are too many HeronInstance processes on each node.

HosiYuki commented Sep 26, 2016

@ashvina - Thanks a lot. I modified the yarn site config as you said, and the error was corrected.

However, I met news problems.

The following is info from heron-ExclamationTopology-scheduler.log.0 in log-files :

Sep 26, 2016 2:37:59 PM com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager createNode
INFO: Created node for path: /heron/schedulers/ExclamationTopology
Sep 26, 2016 2:37:59 PM com.twitter.heron.scheduler.SchedulerMain runScheduler
INFO: Waiting for termination...
Sep 26, 2016 2:38:00 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver submitHeronExecutorTask
INFO: Submitting evaluator task for id: 1
Sep 26, 2016 2:38:00 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$HeronWorkerStartHandler onNext
INFO: Task, id:1, has started.
Sep 26, 2016 2:38:05 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager onEvaluatorException
WARNING: Failed evaluator: container_1474871539042_0001_01_000002
org.apache.reef.exception.EvaluatorException: Evaluator [container_1474871539042_0001_01_000002] is assumed to be in state [RUNNING]. But the resource manager reports it to be in state [FAILED]. This means that the Evaluator failed but wasn't able to send an error message back to the driver. Task [0] was running when the Evaluator crashed.
        at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:589)
        at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:63)
        at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:36)
        at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:91)
        at org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainerStatus(YarnContainerManager.java:391)
        at org.apache.reef.runtime.yarn.driver.YarnContainerManager.onContainersCompleted(YarnContainerManager.java:128)
        at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:300)

Sep 26, 2016 2:38:05 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler onNext
WARNING: Container:container_1474871539042_0001_01_000002 failed
Sep 26, 2016 2:38:05 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler onNext
INFO: Trying to relaunch executor 0 running on failed container container_1474871539042_0001_01_000002
Sep 26, 2016 2:38:05 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver allocateContainer
INFO: Requesting container for executor, id: 0, mem: 1,024, cpu: 1
Sep 26, 2016 2:38:06 PM org.apache.reef.wake.impl.ThreadPoolStage close
WARNING: Executor did not terminate in 1000ms.
Sep 26, 2016 2:38:06 PM org.apache.reef.wake.impl.ThreadPoolStage close
WARNING: Executor dropped 0 tasks.
Sep 26, 2016 2:38:06 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler onNext
SEVERE: Failed to relaunch failed container: 0
com.twitter.heron.scheduler.yarn.HeronMasterDriver$ContainerAllocationException: Interrupted while waiting for container
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver.launchContainerForExecutor(HeronMasterDriver.java:233)
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver.access$1900(HeronMasterDriver.java:74)
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler.onNext(HeronMasterDriver.java:467)
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver$FailedContainerHandler.onNext(HeronMasterDriver.java:451)
        at org.apache.reef.runtime.common.driver.evaluator.IdlenessCallbackEventHandler.onNext(IdlenessCallbackEventHandler.java:46)
        at org.apache.reef.runtime.common.utils.BroadCastEventHandler.onNext(BroadCastEventHandler.java:40)
        at org.apache.reef.util.ExceptionHandlingEventHandler.onNext(ExceptionHandlingEventHandler.java:46)
        at org.apache.reef.runtime.common.utils.DispatchingEStage$1.onNext(DispatchingEStage.java:68)
        at org.apache.reef.runtime.common.utils.DispatchingEStage$1.onNext(DispatchingEStage.java:65)
        at org.apache.reef.wake.impl.ThreadPoolStage$1.run(ThreadPoolStage.java:182)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at com.twitter.heron.scheduler.yarn.HeronMasterDriver.launchContainerForExecutor(HeronMasterDriver.java:231)
        ... 14 more

Besides, there are too many HeronInstance processes on each node.

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Sep 26, 2016

Contributor

@HosiYuki I have not seen this issue before.

  1. Are you seeing too many orphaned HeronInstances, i.e. instances present even after the yarn application is killed. Do you also see any executor processes?
  2. Could you check the executor and shell logs files in the namenodes's local directory
Contributor

ashvina commented Sep 26, 2016

@HosiYuki I have not seen this issue before.

  1. Are you seeing too many orphaned HeronInstances, i.e. instances present even after the yarn application is killed. Do you also see any executor processes?
  2. Could you check the executor and shell logs files in the namenodes's local directory
@HosiYuki

This comment has been minimized.

Show comment
Hide comment
@HosiYuki

HosiYuki Sep 27, 2016

@ashvina

I use the command heron kill yarn ExclamationTopology, and the info shows INFO: Successfully kill topology 'ExclamationTopology'.
But the following processes still exist on each node.

cheng@node18-15:~$ jps
6976 HeronInstance
8064 HeronInstance
9091 HeronInstance
6983 MetricsManager
9671 HeronInstance
8583 MetricsManager
8073 MetricsManager
11019 HeronInstance
9104 HeronInstance
9685 MetricsManager
11032 HeronInstance
5336 DataNode
7321 HeronInstance
6557 HeronInstance
9119 MetricsManager
6180 HeronInstance
7782 HeronInstance
11046 MetricsManager
5482 NodeManager
6955 HeronInstance
8557 HeronInstance
7342 HeronInstance
6578 HeronInstance
7795 HeronInstance
8053 HeronInstance
7350 MetricsManager
6200 HeronInstance
6585 MetricsManager
9658 HeronInstance
8570 HeronInstance
7805 MetricsManager
12350 Jps
6207 MetricsManager
cheng     9076     1  0 10:27 ?        00:00:00 python2.7 ./heron-core/bin/heron-executor 1 ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.d
cheng     9090  9076  0 10:27 ?        00:00:00 python2.7 ./heron-core/bin/heron-shell --port=47231 --log_file_prefix=log-files/heron-shell.log
cheng     9091  9076  0 10:27 ?        00:00:06 /home/cheng/jdk1.8.0_11/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxPermSize=128M -XX:PermSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScav
cheng     9101  9076  0 10:27 ?        00:00:02 ./heron-core/bin/heron-stmgr ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.defn 10.107.18.2
cheng     9104  9076  0 10:27 ?        00:00:06 /home/cheng/jdk1.8.0_11/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxPermSize=128M -XX:PermSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScav
cheng     9119  9076  0 10:27 ?        00:00:03 /home/cheng/jdk1.8.0_11/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateS
cheng     9598     1  0 10:28 ?        00:00:00 python2.7 ./heron-core/bin/heron-executor 1 ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.d
cheng     9639  9598  0 10:28 ?        00:00:00 python2.7 ./heron-core/bin/heron-shell --port=38657 --log_file_prefix=log-files/heron-shell.log
cheng     9658  9598  0 10:28 ?        00:00:05 /home/cheng/jdk1.8.0_11/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxPermSize=128M -XX:PermSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScav
cheng     9665  9598  0 10:28 ?        00:00:02 ./heron-core/bin/heron-stmgr ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.defn 10.107.18.2
cheng     9671  9598  0 10:28 ?        00:00:05 /home/cheng/jdk1.8.0_11/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxPermSize=128M -XX:PermSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScav
cheng     9685  9598  0 10:28 ?        00:00:03 /home/cheng/jdk1.8.0_11/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateS
root     10349     2  0 10:31 ?        00:00:00 [kworker/u16:2]
cheng    11006     1  0 10:32 ?        00:00:00 python2.7 ./heron-core/bin/heron-executor 1 ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.d
cheng    11018 11006  0 10:32 ?        00:00:00 python2.7 ./heron-core/bin/heron-shell --port=56344 --log_file_prefix=log-files/heron-shell.log

Part of hadoop namenode logs:

2016-09-27 10:35:59,297 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/2016_09_27_10_24_13_reef-job-1/reef-evaluator-3160321392980106428.jar is closed by DFSClient_NONMAPREDUCE_-1142593035_1
2016-09-27 10:36:10,321 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741968_1144{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-69aed421-e042-4aa6-84e3-098ac974d44b:NORMAL:10.107.18.215:50010|RBW], ReplicaUC[[DISK]DS-9cc7cded-f003-44c7-be87-f685cd8ad39d:NORMAL:10.107.18.212:50010|RBW]]} for /tmp/2016_09_27_10_24_13_reef-job-1/reef-evaluator-245277802915514085.jar
2016-09-27 10:36:10,343 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.107.18.212:50010 is added to blk_1073741968_1144{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-69aed421-e042-4aa6-84e3-098ac974d44b:NORMAL:10.107.18.215:50010|RBW], ReplicaUC[[DISK]DS-9cc7cded-f003-44c7-be87-f685cd8ad39d:NORMAL:10.107.18.212:50010|RBW]]} size 0
2016-09-27 10:36:10,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.107.18.215:50010 is added to blk_1073741968_1144{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-69aed421-e042-4aa6-84e3-098ac974d44b:NORMAL:10.107.18.215:50010|RBW], ReplicaUC[[DISK]DS-9cc7cded-f003-44c7-be87-f685cd8ad39d:NORMAL:10.107.18.212:50010|RBW]]} size 0

Part of yarn resource manager logs:

2016-09-27 10:24:51,545 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1474942797178_0001_000001 released container container_1474942797178_0001_01_000010 on node: host: node18-13.pdl.net:38504 #containers=0 available=<memory:8192, vCores:8> used=<memory:0, vCores:0> with event: RELEASED
2016-09-27 10:24:51,666 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1474942797178_0001_01_000008 Container Transitioned from ACQUIRED to RUNNING
2016-09-27 10:24:52,174 ERROR org.apache.hadoop.yarn.server.webapp.ContainerBlock: Failed to read the container container_1474942797178_0001_01_000002.
java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
    at org.apache.hadoop.yarn.server.webapp.ContainerBlock.render(ContainerBlock.java:77)
    at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
    at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
    at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
    at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
    at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
    at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
    at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
    at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
    at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
    at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.container(RmController.java:62)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
    at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
    at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
    at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:142)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
    at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
    at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
    at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:595)
    at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
    at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:554)
    at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1243)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
    at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
    at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
    at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.hadoop.yarn.exceptions.ContainerNotFoundException: Container with id 'container_1474942797178_0001_01_000002' doesn't exist in RM.
    at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:464)
    at org.apache.hadoop.yarn.server.webapp.ContainerBlock$1.run(ContainerBlock.java:81)
    at org.apache.hadoop.yarn.server.webapp.ContainerBlock$1.run(ContainerBlock.java:78)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    ... 58 more
2016-09-27 10:24:52,667 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1474942797178_0001_01_000011 Container Transitioned from NEW to ALLOCATED

HosiYuki commented Sep 27, 2016

@ashvina

I use the command heron kill yarn ExclamationTopology, and the info shows INFO: Successfully kill topology 'ExclamationTopology'.
But the following processes still exist on each node.

cheng@node18-15:~$ jps
6976 HeronInstance
8064 HeronInstance
9091 HeronInstance
6983 MetricsManager
9671 HeronInstance
8583 MetricsManager
8073 MetricsManager
11019 HeronInstance
9104 HeronInstance
9685 MetricsManager
11032 HeronInstance
5336 DataNode
7321 HeronInstance
6557 HeronInstance
9119 MetricsManager
6180 HeronInstance
7782 HeronInstance
11046 MetricsManager
5482 NodeManager
6955 HeronInstance
8557 HeronInstance
7342 HeronInstance
6578 HeronInstance
7795 HeronInstance
8053 HeronInstance
7350 MetricsManager
6200 HeronInstance
6585 MetricsManager
9658 HeronInstance
8570 HeronInstance
7805 MetricsManager
12350 Jps
6207 MetricsManager
cheng     9076     1  0 10:27 ?        00:00:00 python2.7 ./heron-core/bin/heron-executor 1 ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.d
cheng     9090  9076  0 10:27 ?        00:00:00 python2.7 ./heron-core/bin/heron-shell --port=47231 --log_file_prefix=log-files/heron-shell.log
cheng     9091  9076  0 10:27 ?        00:00:06 /home/cheng/jdk1.8.0_11/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxPermSize=128M -XX:PermSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScav
cheng     9101  9076  0 10:27 ?        00:00:02 ./heron-core/bin/heron-stmgr ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.defn 10.107.18.2
cheng     9104  9076  0 10:27 ?        00:00:06 /home/cheng/jdk1.8.0_11/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxPermSize=128M -XX:PermSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScav
cheng     9119  9076  0 10:27 ?        00:00:03 /home/cheng/jdk1.8.0_11/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateS
cheng     9598     1  0 10:28 ?        00:00:00 python2.7 ./heron-core/bin/heron-executor 1 ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.d
cheng     9639  9598  0 10:28 ?        00:00:00 python2.7 ./heron-core/bin/heron-shell --port=38657 --log_file_prefix=log-files/heron-shell.log
cheng     9658  9598  0 10:28 ?        00:00:05 /home/cheng/jdk1.8.0_11/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxPermSize=128M -XX:PermSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScav
cheng     9665  9598  0 10:28 ?        00:00:02 ./heron-core/bin/heron-stmgr ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.defn 10.107.18.2
cheng     9671  9598  0 10:28 ?        00:00:05 /home/cheng/jdk1.8.0_11/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxPermSize=128M -XX:PermSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScav
cheng     9685  9598  0 10:28 ?        00:00:03 /home/cheng/jdk1.8.0_11/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateS
root     10349     2  0 10:31 ?        00:00:00 [kworker/u16:2]
cheng    11006     1  0 10:32 ?        00:00:00 python2.7 ./heron-core/bin/heron-executor 1 ExclamationTopology ExclamationTopology87f09956-f369-42b7-9f26-3b41f7faac72 ExclamationTopology.d
cheng    11018 11006  0 10:32 ?        00:00:00 python2.7 ./heron-core/bin/heron-shell --port=56344 --log_file_prefix=log-files/heron-shell.log

Part of hadoop namenode logs:

2016-09-27 10:35:59,297 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/2016_09_27_10_24_13_reef-job-1/reef-evaluator-3160321392980106428.jar is closed by DFSClient_NONMAPREDUCE_-1142593035_1
2016-09-27 10:36:10,321 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741968_1144{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-69aed421-e042-4aa6-84e3-098ac974d44b:NORMAL:10.107.18.215:50010|RBW], ReplicaUC[[DISK]DS-9cc7cded-f003-44c7-be87-f685cd8ad39d:NORMAL:10.107.18.212:50010|RBW]]} for /tmp/2016_09_27_10_24_13_reef-job-1/reef-evaluator-245277802915514085.jar
2016-09-27 10:36:10,343 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.107.18.212:50010 is added to blk_1073741968_1144{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-69aed421-e042-4aa6-84e3-098ac974d44b:NORMAL:10.107.18.215:50010|RBW], ReplicaUC[[DISK]DS-9cc7cded-f003-44c7-be87-f685cd8ad39d:NORMAL:10.107.18.212:50010|RBW]]} size 0
2016-09-27 10:36:10,345 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.107.18.215:50010 is added to blk_1073741968_1144{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-69aed421-e042-4aa6-84e3-098ac974d44b:NORMAL:10.107.18.215:50010|RBW], ReplicaUC[[DISK]DS-9cc7cded-f003-44c7-be87-f685cd8ad39d:NORMAL:10.107.18.212:50010|RBW]]} size 0

Part of yarn resource manager logs:

2016-09-27 10:24:51,545 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1474942797178_0001_000001 released container container_1474942797178_0001_01_000010 on node: host: node18-13.pdl.net:38504 #containers=0 available=<memory:8192, vCores:8> used=<memory:0, vCores:0> with event: RELEASED
2016-09-27 10:24:51,666 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1474942797178_0001_01_000008 Container Transitioned from ACQUIRED to RUNNING
2016-09-27 10:24:52,174 ERROR org.apache.hadoop.yarn.server.webapp.ContainerBlock: Failed to read the container container_1474942797178_0001_01_000002.
java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
    at org.apache.hadoop.yarn.server.webapp.ContainerBlock.render(ContainerBlock.java:77)
    at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
    at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
    at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
    at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
    at org.apache.hadoop.yarn.webapp.hamlet.HamletImpl$EImp._v(HamletImpl.java:117)
    at org.apache.hadoop.yarn.webapp.hamlet.Hamlet$TD._(Hamlet.java:845)
    at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:56)
    at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
    at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
    at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.container(RmController.java:62)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
    at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
    at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
    at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:142)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
    at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
    at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
    at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:595)
    at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:291)
    at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:554)
    at org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1243)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
    at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
    at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
    at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.hadoop.yarn.exceptions.ContainerNotFoundException: Container with id 'container_1474942797178_0001_01_000002' doesn't exist in RM.
    at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:464)
    at org.apache.hadoop.yarn.server.webapp.ContainerBlock$1.run(ContainerBlock.java:81)
    at org.apache.hadoop.yarn.server.webapp.ContainerBlock$1.run(ContainerBlock.java:78)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    ... 58 more
2016-09-27 10:24:52,667 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1474942797178_0001_01_000011 Container Transitioned from NEW to ALLOCATED

@objmagic objmagic removed the question label Oct 10, 2016

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Oct 24, 2016

Contributor

@HosiYuki I got occupied with other work and lost track of this issue. Are you still observing this issue? In case you plan to submit new topology, please kill any heron related yarn applications and heron processes on all nodes. Command: yarn application -kill yarn_application_id

Contributor

ashvina commented Oct 24, 2016

@HosiYuki I got occupied with other work and lost track of this issue. Are you still observing this issue? In case you plan to submit new topology, please kill any heron related yarn applications and heron processes on all nodes. Command: yarn application -kill yarn_application_id

@amirfirouzi

This comment has been minimized.

Show comment
Hide comment
@amirfirouzi

amirfirouzi Dec 26, 2016

hi guys, i just started to deploy heron over YARN and got some issues and answers her will absolutely help!
just a quick question, am i supposed to install heron binaries(client and tools) over all nodes in yarn cluster(master and slaves) or installing over master will suffice? i figured because YarnLauncher is uploading heron-core and we are submitting topologies to master, installing heron over slaves won't be necessary, is that right?
thanks

amirfirouzi commented Dec 26, 2016

hi guys, i just started to deploy heron over YARN and got some issues and answers her will absolutely help!
just a quick question, am i supposed to install heron binaries(client and tools) over all nodes in yarn cluster(master and slaves) or installing over master will suffice? i figured because YarnLauncher is uploading heron-core and we are submitting topologies to master, installing heron over slaves won't be necessary, is that right?
thanks

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Dec 29, 2016

Contributor

@amirfirouzi - That is right, installation of heron binaries on the YARN nodes is not required. You may find additional details here:http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/

Contributor

ashvina commented Dec 29, 2016

@amirfirouzi - That is right, installation of heron binaries on the YARN nodes is not required. You may find additional details here:http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/

@amirfirouzi

This comment has been minimized.

Show comment
Hide comment
@amirfirouzi

amirfirouzi Dec 30, 2016

I've set up a multi-node hadoop(v.2.7.2) cluster to test heron(v0.14.5) over Yarn Cluster. i've done exactly whats in Yarn Cluster documentation section, and also read answers in this issue and it helped me to fix some problems like needed custom jar files(updated jars), yarn & mapreduce classpathes in yarn-site & mapred-site(to fix avro error) and in submit command and all of them are done. here are what i've done and some of my key configs:

  • Yarn cluster is healthy(hadoop apps run successfully), installed and configured ZK in master node of cluster, installed heron-client and heron-tools binaries in master, configured statemanager to use CuratorStateManager and connection.string to localhost:2181 and modified heron_tracker.yaml to use zk and Scheduler and Launcher and Uploader are set as the doc suggested.

after fixing some configs, now when i submit topology to heron, it seems to be submitted successfully and no errors are logged, and in yarn's UI it marks the submitted application as SUCCEEDED, but it's strange that the running of each topology lasts for couple of seconds and when i see the Yarn's users Logs, there are only one container in one of the slaves! and in that container the logging stops after "INFO: Launching Heron scheduler". here is the log generated for slave1's driver.stderr of my cluster:

OpenJDK 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
Dec 28, 2016 12:36:54 PM org.apache.reef.runtime.common.REEFLauncher main
INFO: Entering REEFLauncher.main().
Dec 28, 2016 12:36:54 PM org.apache.reef.util.REEFVersion logVersion
INFO: REEF Version: 0.14.0
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop_data/tmp/nm-local-dir/usercache/hduser/appcache/application_1482914874252_0003/filecache/10/reef-job-3258437929217330603.jar/global/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
Dec 28, 2016 12:36:55 PM org.apache.hadoop.yarn.client.RMProxy createRMProxy
INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8050
Dec 28, 2016 12:36:55 PM org.apache.hadoop.yarn.client.RMProxy createRMProxy
INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8030
Dec 28, 2016 12:36:55 PM org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl serviceInit
INFO: Upper bound of the thread pool size is 500
Dec 28, 2016 12:36:55 PM org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy
INFO: yarn.client.max-cached-nodemanagers-proxies : 0
Dec 28, 2016 12:36:56 PM com.twitter.heron.scheduler.yarn.HeronReefUtils extractPackageInSandbox
INFO: Extracting package: reef/global/topology.tar.gz at: .
tar: ExclamationTopology.defn: time stamp 2016-12-28 12:36:59 is 2.692387499 s in the future
tar: ./heron-conf/override.yaml: time stamp 2016-12-28 12:36:59 is 2.692153692 s in the future
Dec 28, 2016 12:36:56 PM com.twitter.heron.scheduler.yarn.HeronReefUtils extractPackageInSandbox
INFO: Extracting package: reef/global/heron-core.tar.gz at: .
Dec 28, 2016 12:36:56 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$HeronSchedulerLauncher launchScheduler
INFO: Launching Heron scheduler

here is my submit command:
heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --extra-launch-classpath /home/hduser/libs/heron-yarn-classpath-jars/*:/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*

here is last part of returned output in shell:

[2016-12-28 12:37:01 +0330] org.apache.reef.util.REEFVersion INFO: REEF Version: 0.14.0
[2016-12-28 12:37:01 +0330] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers INFO: Initializing REEF client handlers for Heron, topology: ExclamationTopology
[2016-12-28 12:37:01 +0330] org.apache.hadoop.yarn.client.RMProxy INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8050
[2016-12-28 12:37:10 +0330] org.apache.reef.runtime.common.files.JobJarMaker WARNING: Failed to delete [/tmp/reef-job-2395324188246634929]
[2016-12-28 12:37:11 +0330] org.apache.reef.runtime.yarn.client.YarnSubmissionHelper INFO: Submitting REEF Application to YARN. ID: application_1482914874252_0003
[2016-12-28 12:37:11 +0330] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl INFO: Submitted application application_1482914874252_0003
[2016-12-28 12:37:15 +0330] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers INFO: Topology ExclamationTopology is running, jobId ExclamationTopology.
[2016-12-28 12:37:15 +0330] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the CuratorClient to: 127.0.0.1:2181
16/12/28 12:37:15 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
16/12/28 12:37:15 INFO zookeeper.ZooKeeper: Session: 0x159449b7dd90004 closed
[2016-12-28 12:37:15 +0330] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the tunnel processes
16/12/28 12:37:15 INFO zookeeper.ClientCnxn: EventThread shut down
[2016-12-28 12:37:17 +0330] org.apache.reef.runtime.common.client.defaults.DefaultCompletedJobHandler INFO: Job Completed: CompletedJob{'ExclamationTopology'}

and it stops here! it's strange, why it doesn't continue running? and why only one container is allocated? it seems that Application is submitted successfully but before HeronScheduler is run to schedule containers, somehow it stops running. but why Yarn thinks it is succeeded? and how to make it keep running?

Any help how to detect the problem and solve it? it's driving me crazy.
Thanks

amirfirouzi commented Dec 30, 2016

I've set up a multi-node hadoop(v.2.7.2) cluster to test heron(v0.14.5) over Yarn Cluster. i've done exactly whats in Yarn Cluster documentation section, and also read answers in this issue and it helped me to fix some problems like needed custom jar files(updated jars), yarn & mapreduce classpathes in yarn-site & mapred-site(to fix avro error) and in submit command and all of them are done. here are what i've done and some of my key configs:

  • Yarn cluster is healthy(hadoop apps run successfully), installed and configured ZK in master node of cluster, installed heron-client and heron-tools binaries in master, configured statemanager to use CuratorStateManager and connection.string to localhost:2181 and modified heron_tracker.yaml to use zk and Scheduler and Launcher and Uploader are set as the doc suggested.

after fixing some configs, now when i submit topology to heron, it seems to be submitted successfully and no errors are logged, and in yarn's UI it marks the submitted application as SUCCEEDED, but it's strange that the running of each topology lasts for couple of seconds and when i see the Yarn's users Logs, there are only one container in one of the slaves! and in that container the logging stops after "INFO: Launching Heron scheduler". here is the log generated for slave1's driver.stderr of my cluster:

OpenJDK 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
Dec 28, 2016 12:36:54 PM org.apache.reef.runtime.common.REEFLauncher main
INFO: Entering REEFLauncher.main().
Dec 28, 2016 12:36:54 PM org.apache.reef.util.REEFVersion logVersion
INFO: REEF Version: 0.14.0
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop_data/tmp/nm-local-dir/usercache/hduser/appcache/application_1482914874252_0003/filecache/10/reef-job-3258437929217330603.jar/global/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
Dec 28, 2016 12:36:55 PM org.apache.hadoop.yarn.client.RMProxy createRMProxy
INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8050
Dec 28, 2016 12:36:55 PM org.apache.hadoop.yarn.client.RMProxy createRMProxy
INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8030
Dec 28, 2016 12:36:55 PM org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl serviceInit
INFO: Upper bound of the thread pool size is 500
Dec 28, 2016 12:36:55 PM org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy
INFO: yarn.client.max-cached-nodemanagers-proxies : 0
Dec 28, 2016 12:36:56 PM com.twitter.heron.scheduler.yarn.HeronReefUtils extractPackageInSandbox
INFO: Extracting package: reef/global/topology.tar.gz at: .
tar: ExclamationTopology.defn: time stamp 2016-12-28 12:36:59 is 2.692387499 s in the future
tar: ./heron-conf/override.yaml: time stamp 2016-12-28 12:36:59 is 2.692153692 s in the future
Dec 28, 2016 12:36:56 PM com.twitter.heron.scheduler.yarn.HeronReefUtils extractPackageInSandbox
INFO: Extracting package: reef/global/heron-core.tar.gz at: .
Dec 28, 2016 12:36:56 PM com.twitter.heron.scheduler.yarn.HeronMasterDriver$HeronSchedulerLauncher launchScheduler
INFO: Launching Heron scheduler

here is my submit command:
heron submit yarn ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology --extra-launch-classpath /home/hduser/libs/heron-yarn-classpath-jars/*:/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*

here is last part of returned output in shell:

[2016-12-28 12:37:01 +0330] org.apache.reef.util.REEFVersion INFO: REEF Version: 0.14.0
[2016-12-28 12:37:01 +0330] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers INFO: Initializing REEF client handlers for Heron, topology: ExclamationTopology
[2016-12-28 12:37:01 +0330] org.apache.hadoop.yarn.client.RMProxy INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8050
[2016-12-28 12:37:10 +0330] org.apache.reef.runtime.common.files.JobJarMaker WARNING: Failed to delete [/tmp/reef-job-2395324188246634929]
[2016-12-28 12:37:11 +0330] org.apache.reef.runtime.yarn.client.YarnSubmissionHelper INFO: Submitting REEF Application to YARN. ID: application_1482914874252_0003
[2016-12-28 12:37:11 +0330] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl INFO: Submitted application application_1482914874252_0003
[2016-12-28 12:37:15 +0330] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers INFO: Topology ExclamationTopology is running, jobId ExclamationTopology.
[2016-12-28 12:37:15 +0330] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the CuratorClient to: 127.0.0.1:2181
16/12/28 12:37:15 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
16/12/28 12:37:15 INFO zookeeper.ZooKeeper: Session: 0x159449b7dd90004 closed
[2016-12-28 12:37:15 +0330] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager INFO: Closing the tunnel processes
16/12/28 12:37:15 INFO zookeeper.ClientCnxn: EventThread shut down
[2016-12-28 12:37:17 +0330] org.apache.reef.runtime.common.client.defaults.DefaultCompletedJobHandler INFO: Job Completed: CompletedJob{'ExclamationTopology'}

and it stops here! it's strange, why it doesn't continue running? and why only one container is allocated? it seems that Application is submitted successfully but before HeronScheduler is run to schedule containers, somehow it stops running. but why Yarn thinks it is succeeded? and how to make it keep running?

Any help how to detect the problem and solve it? it's driving me crazy.
Thanks

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Dec 31, 2016

Contributor

@amirfirouzi - Once the heron scheduler starts, all logs are redirected to a heron log file in the container's local directory. You should find heron-core binaries, log files, etc in this directory. Could you please share what you see there?

Regarding additional containers, YARN cluster must have sufficient resources. Otherwise new containers will not be allocated. Please check if this is the case.

Contributor

ashvina commented Dec 31, 2016

@amirfirouzi - Once the heron scheduler starts, all logs are redirected to a heron log file in the container's local directory. You should find heron-core binaries, log files, etc in this directory. Could you please share what you see there?

Regarding additional containers, YARN cluster must have sufficient resources. Otherwise new containers will not be allocated. Please check if this is the case.

@amirfirouzi

This comment has been minimized.

Show comment
Hide comment
@amirfirouzi

amirfirouzi Dec 31, 2016

Dear @ashvina,
thanks for the response, I've looked at all slaves for this log file, there is none! there is just one userlogs folder in the container local dir and it contains only std.err with this content and as i said before it stops logs at Starting Heron Launcher(so i think it can't start Scheduler):

OpenJDK 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
Dec 31, 2016 11:50:22 AM org.apache.reef.runtime.common.REEFLauncher main
INFO: Entering REEFLauncher.main().
Dec 31, 2016 11:50:22 AM org.apache.reef.util.REEFVersion logVersion
INFO: REEF Version: 0.14.0
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop_data/tmp/nm-local-dir/usercache/hduser/appcache/application_1483172140480_0001/filecache/10/reef-job-6303074797688388694.jar/global/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
Dec 31, 2016 11:50:23 AM org.apache.hadoop.yarn.client.RMProxy createRMProxy
INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8050
Dec 31, 2016 11:50:23 AM org.apache.hadoop.yarn.client.RMProxy createRMProxy
INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8030
Dec 31, 2016 11:50:23 AM org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl serviceInit
INFO: Upper bound of the thread pool size is 500
Dec 31, 2016 11:50:23 AM org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy
INFO: yarn.client.max-cached-nodemanagers-proxies : 0
Dec 31, 2016 11:50:23 AM com.twitter.heron.scheduler.yarn.HeronReefUtils extractPackageInSandbox
INFO: Extracting package: reef/global/topology.tar.gz at: .
Dec 31, 2016 11:50:24 AM com.twitter.heron.scheduler.yarn.HeronReefUtils extractPackageInSandbox
INFO: Extracting package: reef/global/heron-core.tar.gz at: .
Dec 31, 2016 11:50:24 AM com.twitter.heron.scheduler.yarn.HeronMasterDriver$HeronSchedulerLauncher launchScheduler
INFO: Launching Heron scheduler

and about Resources, i have 1 master and 3 slaves (each with 3GB RAM and 1 cores of CPU), isn't that sufficient?

amirfirouzi commented Dec 31, 2016

Dear @ashvina,
thanks for the response, I've looked at all slaves for this log file, there is none! there is just one userlogs folder in the container local dir and it contains only std.err with this content and as i said before it stops logs at Starting Heron Launcher(so i think it can't start Scheduler):

OpenJDK 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
Dec 31, 2016 11:50:22 AM org.apache.reef.runtime.common.REEFLauncher main
INFO: Entering REEFLauncher.main().
Dec 31, 2016 11:50:22 AM org.apache.reef.util.REEFVersion logVersion
INFO: REEF Version: 0.14.0
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop_data/tmp/nm-local-dir/usercache/hduser/appcache/application_1483172140480_0001/filecache/10/reef-job-6303074797688388694.jar/global/heron-zookeeper-statemgr.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
Dec 31, 2016 11:50:23 AM org.apache.hadoop.yarn.client.RMProxy createRMProxy
INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8050
Dec 31, 2016 11:50:23 AM org.apache.hadoop.yarn.client.RMProxy createRMProxy
INFO: Connecting to ResourceManager at hadoopmaster/192.168.100.100:8030
Dec 31, 2016 11:50:23 AM org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl serviceInit
INFO: Upper bound of the thread pool size is 500
Dec 31, 2016 11:50:23 AM org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy
INFO: yarn.client.max-cached-nodemanagers-proxies : 0
Dec 31, 2016 11:50:23 AM com.twitter.heron.scheduler.yarn.HeronReefUtils extractPackageInSandbox
INFO: Extracting package: reef/global/topology.tar.gz at: .
Dec 31, 2016 11:50:24 AM com.twitter.heron.scheduler.yarn.HeronReefUtils extractPackageInSandbox
INFO: Extracting package: reef/global/heron-core.tar.gz at: .
Dec 31, 2016 11:50:24 AM com.twitter.heron.scheduler.yarn.HeronMasterDriver$HeronSchedulerLauncher launchScheduler
INFO: Launching Heron scheduler

and about Resources, i have 1 master and 3 slaves (each with 3GB RAM and 1 cores of CPU), isn't that sufficient?

@amirfirouzi

This comment has been minimized.

Show comment
Hide comment
@amirfirouzi

amirfirouzi Dec 31, 2016

i tested to find out if it's deployable on local Yarn(1 machine) and it failed just like the cluster, after reading Yarn logs found out that it was killing containers because of this option in yarn-site.xml:
yarn.nodemanager.vmem-pmem-ratio (the ratio of virtual memory used based on physical memory-in this case 1GB for physical mem). so i increased the ratio and also set the yarn.scheduler.minimum(and maximum)-allocation-mb(and vcores) and also yarn.nodemanager.resource.memory-mb &
yarn.nodemanager.resource.cpu-vcores for node manager and after these settings topologies keeps running.
it's worth mentioning that you should check topology required resource and available resources and min/max allocatable resource as you've set in Yarn, requested resource (cpu and ram) must be less than yarn.scheduler.maximum-allocation-mb & yarn.scheduler.maximum-allocation-vcores and of course less than available resources in the cluster over all:

requested mem for topology< SUM_OF(yarn.nodemanager.resource.memory-mb) (for all slave node)
requested cpu for topology< SUM_OF(yarn.nodemanager.resource.cpu-vcores) (for all slave node)

so check and rebuild your topology before submitting to Yarn cluster

amirfirouzi commented Dec 31, 2016

i tested to find out if it's deployable on local Yarn(1 machine) and it failed just like the cluster, after reading Yarn logs found out that it was killing containers because of this option in yarn-site.xml:
yarn.nodemanager.vmem-pmem-ratio (the ratio of virtual memory used based on physical memory-in this case 1GB for physical mem). so i increased the ratio and also set the yarn.scheduler.minimum(and maximum)-allocation-mb(and vcores) and also yarn.nodemanager.resource.memory-mb &
yarn.nodemanager.resource.cpu-vcores for node manager and after these settings topologies keeps running.
it's worth mentioning that you should check topology required resource and available resources and min/max allocatable resource as you've set in Yarn, requested resource (cpu and ram) must be less than yarn.scheduler.maximum-allocation-mb & yarn.scheduler.maximum-allocation-vcores and of course less than available resources in the cluster over all:

requested mem for topology< SUM_OF(yarn.nodemanager.resource.memory-mb) (for all slave node)
requested cpu for topology< SUM_OF(yarn.nodemanager.resource.cpu-vcores) (for all slave node)

so check and rebuild your topology before submitting to Yarn cluster

@amirfirouzi

This comment has been minimized.

Show comment
Hide comment
@amirfirouzi

amirfirouzi Dec 31, 2016

just one question after running topologies, is it possible to track the logs and metrics using heron-ui? cuz i ran the heron-tracker and ui on the master and it doesn't show anything in the ui(topology plans and metrics), i also set the heron-tracker.yaml to use ZK.

and where are heron logs being saved? because now i've tested on the cluster and it succeed on Yarn but stops running and i want to check heron logs to find out whats wrong?
@ashvina even if i use ZK for state management logs will be saved in slaves local directory? but where? i don't see any and i've looked everywhere!
thanks

amirfirouzi commented Dec 31, 2016

just one question after running topologies, is it possible to track the logs and metrics using heron-ui? cuz i ran the heron-tracker and ui on the master and it doesn't show anything in the ui(topology plans and metrics), i also set the heron-tracker.yaml to use ZK.

and where are heron logs being saved? because now i've tested on the cluster and it succeed on Yarn but stops running and i want to check heron logs to find out whats wrong?
@ashvina even if i use ZK for state management logs will be saved in slaves local directory? but where? i don't see any and i've looked everywhere!
thanks

@huijunw huijunw added Yarn and removed Yarn labels Jul 23, 2017

@yesimsure

This comment has been minimized.

Show comment
Hide comment
@yesimsure

yesimsure Nov 2, 2017

@amirfirouzi Since it has been a long time, you may work it out. However, hope it can helps others. To find <NM_LOCAL_DIR> in the second and forth term of Log File Location here, check the value of ${yarn.nodemanager.local-dirs} in yarn-site.xml. The default value is ${hadoop.tmp.dir}/nm-local-dir. And the default value of ${hadoop.tmp.dir} is /tmp/hadoop-${user.name}. I found my Topo logs here.

yesimsure commented Nov 2, 2017

@amirfirouzi Since it has been a long time, you may work it out. However, hope it can helps others. To find <NM_LOCAL_DIR> in the second and forth term of Log File Location here, check the value of ${yarn.nodemanager.local-dirs} in yarn-site.xml. The default value is ${hadoop.tmp.dir}/nm-local-dir. And the default value of ${hadoop.tmp.dir} is /tmp/hadoop-${user.name}. I found my Topo logs here.

@yesimsure

This comment has been minimized.

Show comment
Hide comment
@yesimsure

yesimsure Nov 2, 2017

@silence-liu @amirfirouzi
Follow the steps provided by @mycFelix , I can see my topology on heron-ui after step 3. It IS a problem of heron-tracker configuration for me. Thanks a lot.

  1. I think the first step you need to check is whether your topology is running well on YARN. Pls check your YARN-scheduler-webstie to confirm your applicationId's status.

  2. If your topology is running well on YARN, then we should focus on the driver.stderr and evaluator.stderr to make sure there is no error while running. You can follow the instructions on http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/ section Log File location

  3. If the step 1 and 2 are both fine. I think what you need to do is to check your .herontools/conf/heron_tracker.yaml configs following the instructions on http://twitter.github.io/heron/docs/operators/heron-tracker/ to make sure the statemgrs is set up right.

yesimsure commented Nov 2, 2017

@silence-liu @amirfirouzi
Follow the steps provided by @mycFelix , I can see my topology on heron-ui after step 3. It IS a problem of heron-tracker configuration for me. Thanks a lot.

  1. I think the first step you need to check is whether your topology is running well on YARN. Pls check your YARN-scheduler-webstie to confirm your applicationId's status.

  2. If your topology is running well on YARN, then we should focus on the driver.stderr and evaluator.stderr to make sure there is no error while running. You can follow the instructions on http://twitter.github.io/heron/docs/operators/deployment/schedulers/yarn/ section Log File location

  3. If the step 1 and 2 are both fine. I think what you need to do is to check your .herontools/conf/heron_tracker.yaml configs following the instructions on http://twitter.github.io/heron/docs/operators/heron-tracker/ to make sure the statemgrs is set up right.

@yesimsure

This comment has been minimized.

Show comment
Hide comment
@yesimsure

yesimsure Nov 2, 2017

@ashvina @mycFelix
Excuse me. I'm not very sure whether my topology submission is successful.
Follow the instruction here, I deploy heron on a single node using yarn as scheduler. I'm using single node zookeeper and hadoop as well.

I can see my topology on heron-ui and yarn-ui, and the result of jps looks like this:

28608 REEFLauncher
28354 SubmitterMain
2243 MetricsManager
28421 REEFLauncher
872 SecondaryNameNode
618 DataNode
4235 ResourceManager
28685 HeronInstance
4398 NodeManager
4242 Jps
4786 QuorumPeerMain
28727 MetricsManager
28696 HeronInstance
476 NameNode

But the submission stuck here:

[2017-11-02 22:07:17 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/topologies/WordCountTopology  
[2017-11-02 22:07:17 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/WordCountTopology  
[2017-11-02 22:07:17 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/executionstate/WordCountTopology  
[2017-11-02 22:07:17 +0800] [WARNING] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  
[2017-11-02 22:07:18 +0800] [INFO] org.apache.reef.util.REEFVersion: REEF Version: 0.14.0  
[2017-11-02 22:07:18 +0800] [INFO] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers: Initializing REEF client handlers for Heron, topology: WordCountTopology  
[2017-11-02 22:07:18 +0800] [INFO] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032  
[2017-11-02 22:07:24 +0800] [WARNING] org.apache.reef.runtime.common.files.JobJarMaker: Failed to delete [/tmp/reef-job-5871854936332514150]  
[2017-11-02 22:07:25 +0800] [INFO] org.apache.reef.runtime.yarn.client.YarnSubmissionHelper: Submitting REEF Application to YARN. ID: application_1509597897282_0004  
[2017-11-02 22:07:25 +0800] [INFO] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1509597897282_0004  
[2017-11-02 22:07:27 +0800] [INFO] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers: Topology WordCountTopology is running, jobId WordCountTopology.  
[2017-11-02 22:07:27 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the CuratorClient to: 127.0.0.1:2181  
17/11/02 22:07:27 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
17/11/02 22:07:27 INFO zookeeper.ZooKeeper: Session: 0x15f7b0c6f680028 closed
[2017-11-02 22:07:27 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the tunnel processes  
17/11/02 22:07:27 INFO zookeeper.ClientCnxn: EventThread shut down

I'm confused about this.

And I can kill the topology from another terminal successfully. About 1 hours later, there is only 1 MetricsManager left. The topology’s overview on heron-ui turns to red and no more data(all metrics are 0) on its detailed page.

Is there anything wrong? Grateful for any help.

yesimsure commented Nov 2, 2017

@ashvina @mycFelix
Excuse me. I'm not very sure whether my topology submission is successful.
Follow the instruction here, I deploy heron on a single node using yarn as scheduler. I'm using single node zookeeper and hadoop as well.

I can see my topology on heron-ui and yarn-ui, and the result of jps looks like this:

28608 REEFLauncher
28354 SubmitterMain
2243 MetricsManager
28421 REEFLauncher
872 SecondaryNameNode
618 DataNode
4235 ResourceManager
28685 HeronInstance
4398 NodeManager
4242 Jps
4786 QuorumPeerMain
28727 MetricsManager
28696 HeronInstance
476 NameNode

But the submission stuck here:

[2017-11-02 22:07:17 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/topologies/WordCountTopology  
[2017-11-02 22:07:17 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/WordCountTopology  
[2017-11-02 22:07:17 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/executionstate/WordCountTopology  
[2017-11-02 22:07:17 +0800] [WARNING] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  
[2017-11-02 22:07:18 +0800] [INFO] org.apache.reef.util.REEFVersion: REEF Version: 0.14.0  
[2017-11-02 22:07:18 +0800] [INFO] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers: Initializing REEF client handlers for Heron, topology: WordCountTopology  
[2017-11-02 22:07:18 +0800] [INFO] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032  
[2017-11-02 22:07:24 +0800] [WARNING] org.apache.reef.runtime.common.files.JobJarMaker: Failed to delete [/tmp/reef-job-5871854936332514150]  
[2017-11-02 22:07:25 +0800] [INFO] org.apache.reef.runtime.yarn.client.YarnSubmissionHelper: Submitting REEF Application to YARN. ID: application_1509597897282_0004  
[2017-11-02 22:07:25 +0800] [INFO] org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1509597897282_0004  
[2017-11-02 22:07:27 +0800] [INFO] com.twitter.heron.scheduler.yarn.ReefClientSideHandlers: Topology WordCountTopology is running, jobId WordCountTopology.  
[2017-11-02 22:07:27 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the CuratorClient to: 127.0.0.1:2181  
17/11/02 22:07:27 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting
17/11/02 22:07:27 INFO zookeeper.ZooKeeper: Session: 0x15f7b0c6f680028 closed
[2017-11-02 22:07:27 +0800] [INFO] com.twitter.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the tunnel processes  
17/11/02 22:07:27 INFO zookeeper.ClientCnxn: EventThread shut down

I'm confused about this.

And I can kill the topology from another terminal successfully. About 1 hours later, there is only 1 MetricsManager left. The topology’s overview on heron-ui turns to red and no more data(all metrics are 0) on its detailed page.

Is there anything wrong? Grateful for any help.

@ashvina

This comment has been minimized.

Show comment
Hide comment
@ashvina

ashvina Nov 2, 2017

Contributor

Hi @yesimsure
The topology submission is successful. The REEF client does not exit on its own. Using Ctrl-C to terminate the submission client after you see this message Topology WordCountTopology is running is fine.

I am not sure what happened to the topology. Did you see any metrics initially? Could you please check and share the container logs?

Contributor

ashvina commented Nov 2, 2017

Hi @yesimsure
The topology submission is successful. The REEF client does not exit on its own. Using Ctrl-C to terminate the submission client after you see this message Topology WordCountTopology is running is fine.

I am not sure what happened to the topology. Did you see any metrics initially? Could you please check and share the container logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment