Simplifying robust end-to-end machine learning on Apache Spark.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
examples
lib
project
sbt
scripts
src
.gitignore
CONTRIBUTORS.md
EC2.md
LICENSE
Makefile
README.md
RELEASE.md
build.sbt

README.md

KeystoneML

The biggest, baddest pipelines around.

Example pipeline

Build the KeystoneML project

./sbt/sbt assembly
make # This builds the native libraries used in KeystoneML

Example: MNIST pipeline

# Get the data from S3
wget http://mnist-data.s3.amazonaws.com/train-mnist-dense-with-labels.data
wget http://mnist-data.s3.amazonaws.com/test-mnist-dense-with-labels.data

KEYSTONE_MEM=4g ./bin/run-pipeline.sh \
  keystoneml.pipelines.images.mnist.MnistRandomFFT \
  --trainLocation ./train-mnist-dense-with-labels.data \
  --testLocation ./test-mnist-dense-with-labels.data \
  --numFFTs 4 \
  --blockSize 2048

Running with spark-submit

To run KeystoneML pipelines on large datasets you will need a Spark cluster. KeystoneML pipelines run on the cluster using spark-submit.

You need to export SPARK_HOME to run KeystoneML using spark-submit. Having done that you can similarly use run-pipeline.sh to launch your pipeline.

export SPARK_HOME=~/spark-1.3.1-bin-cdh4 # should match the version keystone is built with
KEYSTONE_MEM=4g ./bin/run-pipeline.sh \
  keystoneml.pipelines.images.mnist.MnistRandomFFT \
  --trainLocation ./train-mnist-dense-with-labels.data \
  --testLocation ./test-mnist-dense-with-labels.data \
  --numFFTs 4 \
  --blockSize 2048