Wormhole can be built on both Linux and Mac OS X. Some apps are also tested on Windows. To build wormhole, both git
and a recent C++ compiler supporting C++11
, such as g++ >= 4.8
and clang >= 3.5
, are required. Install them on
Ubuntu >= 13.10:
$ sudo apt-get update && sudo apt-get install -y build-essential git
- Older version Ubuntu via ppa:ubuntu-toolchain-r/test:
- Centos via devtoolset
- Mac OS X: can either use the
clang
provided by command line tools or download a compiled gcc from hpc.sourceforge.net
Type make
to build all apps. It may take several minutes for the first time due to building all dependencies such as gflags
. There are several options for advanced usages.
- make xgboost
selectively builds xgboost. Similarly for linear, difactor, ...
- make -j4
uses 4 threads for parallel building. For the first building, we suggest to build deps and apps separately:
make deps -j4 && make -j4
- make CXX=g++-4.9
uses a different compiler
- make DEPS_PATH=your_path
changes the path of the deps. In default all deps will be installed on
wormhole/deps
. We can change the path if them are installed on another place.- make USE_HDFS=1
supports read/write HDFS. It requires
libhdfs
, which is often installed with Hadoop. Apparently Cloudera only ships static version of libhdfs. Hortonworks includes the shared version but not in the lib/native folder. Used ldconfig etc to point compiler, linker and runtime to correct location.- make USE_S3=1
supports read/write AWS S3.
libcurl4-openssl-dev
is required, it can be installed viasudo apt-get install libcurl4-openssl-dev
on Ubuntu- make dmlc=<dmlc_core_path>
in order to run XGBOOST in distributed mode on YARN. Combine with USE_HDFS=1.
Wormhole runs both in a laptop and in a cluster. A typical command to run a application:
$ tracker/dmlc_xxx.py -n num_workers [-s num_servers] app_bin app_conf
- tracker/dmlc_xxx.py
the tracker provided by dmlc-core to launch jobs on various platforms
- -n
number of workers
- -s
number of servers. Only required for parameter server applications
- app_bin
the binary of the application, which is available under
bin/
- app_conf
the text configuration file specifying dataset and learning method, see each app's documents for details
The following command runs linear logistic regression using two workers and a single server on a small dataset:
$ tracker/dmlc_local.py -n 2 -s 1 bin/linear.dmlc learn/linear/guide/demo.conf
First make sure the environments HADOOP_HOME
and JAVA_HOME
are set properly. Next compile the Yarn tracker:
$ cd repo/dmlc-core/yarn && ./build.sh
Then a Yarn job can be submitted via tracker/dmcl_yarn.py
. For example, the following codes run xgboost on Yarn
hdfs_path=/your/path
hadoop fs -mkdir ${hdfs_path}/data
hadoop fs -put learn/data/agaricus.txt.train ${hdfs_path}/data
hadoop fs -put learn/data/agaricus.txt.test ${hdfs_path}/data
tracker/dmlc_yarn.py -n 4 --vcores 2 bin/xgboost.dmlc \
learn/xgboost/mushroom.hadoop.conf nthread=2 \
data=hdfs://${hdfs_path}/data/agaricus.txt.train \
eval[test]=hdfs://${hdfs_path}/data/agaricus.txt.test \
model_out=hdfs://${hdfs_path}/mushroom.final.model
Run tracker/dmlc_yarn.py -h
for more details.
Use tracker/dmlc_sge.py
Wormhole can be run over multiple machines via mpirun
, which is often convenient for a small cluster. Assume file hosts
stores the hostnames of all machines, then use:
$ tracker/dmlc_mpi.py -n num_workers -s num_servers -H hosts bin conf
to launch wormhole on these machines. See next section for an example to setup a cluster with mpirun.
In this section we give a tutorial to setup a small cluster and launch wormhole jobs on Amazon EC2.
- Assume all data are stored Amazon S3.
- Use a middle range instance as the master node to build wormhole and submit jobs, and several high end instances to do the computations.
- Use
NFS
to dispatch binaries and configurations andmpirun
to launch jobs.
First launch an Ubuntu 14.04 instance as the master node. It is mainly used for compiling codes, a middle end instance such as c4.xlarge is often good enough. Install required libraries via:
$ sudo apt-get update && sudo apt-get install -y build-essential git libcurl4-openssl-dev
Then build wormhole with S3 support:
$ git clone https://github.com/dmlc/wormhole.git
$ cd wormhole && make deps -j4 && make -j4 USE_S3=1
Next setup NFS:
$ sudo apt-get install nfs-kernel-server mpich2
$ echo "/home/ubuntu/ *(rw,sync,no_subtree_check)" | sudo tee /etc/exports
$ sudo service nfs-kernel-server start
Finally copy the pem file used to access the master node to master node's ~/.ssh/id_rsa so that this node can access to all other machines.
First launch several Ubuntu 12.04 instances with the same pem file as the slaves nodes. High-end instances such as c4.4xlarge and c4.8xlarge are recommended. Save their private IPs in file `hosts`:
$ cat hosts
172.30.0.172
172.30.0.171
172.30.0.170
Then install both NFS and mpirun on these slave nodes. Assume the master node has private IP 172.30.0.160
:
while read h; do
echo $h
ssh -o StrictHostKeyChecking=no $h <<'ENDSSH'
sudo apt-get update
sudo apt-get install -y nfs-common mpich2
sudo mount 172.30.0.160:/home/ubuntu /home/ubuntu
ENDSSH
done <hosts
Next install depended libraries on all slave nodes:
$ mpirun -hostfile hosts sudo apt-get install -y build-essential libcurl4-openssl-dev
Test if everything is OK:
$ mpirun -hostfile hosts uname -a
$ mpirun -hostfile hosts ldd wormhole/bin/linear.dmlc
Now we can submit jobs from the master node via:
$ wormhole/tracker/dmlc_mpi.py -n ? -s ? -H hosts wormhole/bin/? ?.conf