Skip to content

SparkR Example: Digit Recognition on EC2

Shivaram Venkataraman edited this page Mar 25, 2014 · 3 revisions

SparkR provides a digit recognition example program. To try it out on EC2, follow the next couple steps.

Setting up SparkR on an EC2 cluster

You can follow the instructions here. Note that for SparkR to work, Spark's version should be no older than 0.9.0.

The solver program uses the popular R package, Matrix. To make sure it is available, do the following:

cd /root
wget http://cran.cnr.berkeley.edu/src/contrib/Matrix_1.1-2-2.tar.gz
tar xvzf Matrix_1.1-2-2.tar.gz
R CMD INSTALL Matrix
/root/spark-ec2/copy-dir Matrix_1.1-2-2.tar.gz
/root/spark/sbin/slaves.sh R CMD INSTALL ~/Matrix_1.1-2-2.tar.gz

Getting the MNIST training and test data sets

To obtain the MNIST data sets, we use s3cmd. Use the following commands to download and configure it:

cd /root 
git clone https://github.com/s3tools/s3cmd.git
cd s3cmd 
./s3cmd --configure

You should now be able to configure s3cmd, enter your AWS credentials, etc.

After this is done, simply run

./s3cmd get s3://mnist-data/train-mnist-dense-with-labels.data /data/train-mnist-dense-with-labels.data
./s3cmd get s3://mnist-data/test-mnist-dense-with-labels.data /data/test-mnist-dense-with-labels.data 
/root/spark-ec2/copy-dir /data/

If you wish to store the data on ephemeral disks instead of EBS, you can run /root/ephemeral-hdfs/bin/hadoop fs -copyFromLocal /data/train-mnist-dense-with-labels.data /, and change the textFile() function to take the corresponding HDFS path.

Launching the linear solver

As the last step, we launch the linear solver program provided in SparkR-pkg/examples:

source /root/spark/conf/spark-env.sh 
cd /root/SparkR-pkg
SPARK_MEM=6g ./sparkR examples/linear_solver_mnist.R `cat ~/spark-ec2/cluster-url`

If you are using an instance type that has more memory, you can set a larger executor memory size in the last command. The above should work for m1.large with one slave.

You can now monitor the job progress using Spark's web UI, at http://<master_hostname>:4040.