ClusterStartupFAQ

Please see the ClusterGuide for the general bigdata configuration and federation start procedure. This page provides some additional tips to help you debug your configuration and get the federation up and running.

General Debug Procedure

My general procedure is to visit the host which will be starting the various misc services, source the installed bigdataenv ("source .../bin/bigdataenv") script to set the environment, and then run bigdata start by hand while the cluster run state is at "status". I then monitor the error log and the console and see whether or not the misc services start correctly. Once zookeeper and jini are up the other services can start as well.

When bringing up a new cluster it is a good idea to follow this procedure on at least the misc services node(s) and once on a node of each other service types (a ClientService node and a DataService node). That way you know all the different service classes can start correctly. At that point you can change the run state to 'start' and the rest of the nodes should come up unless there are configuration issues or networking issues with the nodes themselves.

Once the services are starting normally, listServices.sh will report on which services are running on which nodes. Compare the output of this to your configuration plan to verify that all services are running.

The various files mentioned above have the following locations:

* bigdataenv is in ${NAS}/bin * bigdata is in ${NAS}/bin * state is ${NAS}/state (this is the target run state file) * error.log is in ${NAS}/error.log

For more detail on how to configure and start a bigdata federation, please see the ClusterGuide.

Common problems

Exception occurred during unicast discovery

INFO: exception occured during unicast discovery to 192.168.0.3:4160
with constraints InvocationConstraints[reqs: {}, prefs: {}]
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:519)
        at java.net.Socket.connect(Socket.java:469)

This stack trace is normal. Jini logs this message if registrar discovery lookup fails for a given IP address. However, this is how we test to see whether or not jini is running on a host where jini is configured to start. If the lookup fails, then jini SHOULD be started automatically.

Configuration requires exact match for host names (build.properties only)

The build.properties file has some values which are host names. For the build.properties file ONLY, it is critical that the configuration value for the property is the value reported on that host by the *hostname* command. The bigdata and bigdataup shell scripts do exact string comparisons on the hostnames in order to handle some conditional service bootstrapping. Those shell script string compares will fail if the hostname command reports a different value. The main configuration file is more flexible since the configured hostnames are evaluated using DNS.

Bad /etc/hosts

The jini services can have problems connecting to hosts in the cluster if DNS is not setup correctly. It is generally sufficient to have correct entries in /etc/hosts on each host.

Swapping

Make sure that you have issued the following command on each node. This turns down the baseline "panic" level for the linux kernel and will let you use all the RAM on the node without swapping. It DOES NOT disable swapping.

sysctl -w vm.swappiness=0

Also, make sure that you are not assigning too much heap to your java processes. You need to leave enough heap available for the operating system, for the file cache, and for the C heap of the JVM itself.

Problems forking child processes or Spurious RMI Exceptions

Make sure that you have enough swap space allocated on the nodes. If there is not enough space, the kernel may decide that it can not commit to supporting more potential allocations and refuse to shell a child process (Could not allocate memory). It appears that this can also cause RMI failures under some conditions. You can also mitigate this by reducing the size of the direct buffers allocated by bigdata or by messing with the kernel's overcommit behavior. See See http://forums.sun.com/thread.jspa?messageID=9834041#9834041

Introduction