Spark Performance Tests

This is a performance testing framework for Apache Spark 1.0+.

Features

Suites of performance tests for Spark, PySpark, Spark Streaming, and MLlib.
Parameterized test configurations:
- Sweeps sets of parameters to test against multiple Spark and test configurations.
Automatically downloads and builds Spark:
- Maintains a cache of successful builds to enable rapid testing against multiple Spark versions.
[...]

For questions, bug reports, or feature requests, please open an issue on GitHub.

Coverage

Spark Core RDD
- list coming soon
SQL and DataFrames
- coming soon
Machine Learning
- glm-regression: Generalized Linear Regression Model
- glm-classification: Generalized Linear Classification Model
- naive-bayes: Naive Bayes
- naive-bayes-bernoulli: Bernoulli Naive Bayes
- decision-tree: Decision Tree
- als: Alternating Least Squares
- kmeans: K-Means clustering
- gmm: Gaussian Mixture Model
- svd: Singular Value Decomposition
- pca: Principal Component Analysis
- summary-statistics: Summary Statistics (min, max, ...)
- block-matrix-mult: Matrix Multiplication
- pearson: Pearson's Correlation
- spearman: Spearman's Correlation
- chi-sq-feature/gof/mat: Chi-square Tests
- word2vec: Word2Vec distributed presentation of words
- fp-growth: FP-growth frequent item sets
- python-glm-classification: Generalized Linear Classification Model
- python-glm-regression: Generalized Linear Regression Model
- python-naive-bayes: Naive Bayes
- python-als: Alternating Least Squares
- python-kmeans: K-Means clustering
- python-pearson: Pearson's Correlation
- python-spearman: Spearman's Correlation

Dependencies

The spark-perf scripts require Python 2.7+. If you're using an earlier version of Python, you may need to install the argparse library using easy_install argparse.

Support for automatically building Spark requires Maven. On spark-ec2 clusters, this can be installed using the ./bin/spark-ec2/install-maven script from this project.

Configuration

To configure spark-perf, copy config/config.py.template to config/config.py and edit that file. See config.py.template for detailed configuration instructions. After editing config.py, execute ./bin/run to run performance tests. You can pass the --config option to use a custom configuration file.

The following sections describe some additional settings to change for certain test environments:

Running locally

Set up local SSH server/keys such that ssh localhost works on your machine without a password.

Set config.py options that are friendly for local execution:

SPARK_HOME_DIR = /path/to/your/spark
SPARK_CLUSTER_URL = "spark://%s:7077" % socket.gethostname()
SCALE_FACTOR = .05
SPARK_DRIVER_MEMORY = 512m
spark.executor.memory = 2g

Uncomment at least one SPARK_TESTS entry.

Running on an existing Spark cluster

SSH into the machine hosting the standalone master

Set config.py options:

SPARK_HOME_DIR = /path/to/your/spark/install
SPARK_CLUSTER_URL = "spark://<your-master-hostname>:7077"
SCALE_FACTOR = <depends on your hardware>
SPARK_DRIVER_MEMORY = <depends on your hardware>
spark.executor.memory = <depends on your hardware>

Uncomment at least one SPARK_TESTS entry.

Running on a spark-ec2 cluster with a custom Spark version

Launch an EC2 cluster with Spark's EC2 scripts.

Set config.py options:

USE_CLUSTER_SPARK = False
SPARK_COMMIT_ID = <what you want test>
SCALE_FACTOR = <depends on your hardware>
SPARK_DRIVER_MEMORY = <depends on your hardware>
spark.executor.memory = <depends on your hardware>

Uncomment at least one SPARK_TESTS entry.

License

This project is licensed under the Apache 2.0 License. See LICENSE for full license text.

Name		Name	Last commit message	Last commit date
Latest commit History 375 Commits
bin		bin
config		config
dev		dev
lib/sparkperf		lib/sparkperf
mllib-tests		mllib-tests
pyspark-tests		pyspark-tests
spark-tests		spark-tests
streaming-tests		streaming-tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
run-tests.sh		run-tests.sh
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Performance Tests

Features

Coverage

Dependencies

Configuration

Running locally

Running on an existing Spark cluster

Running on a spark-ec2 cluster with a custom Spark version

License

About

Releases

Packages

Contributors 23

Languages

License

databricks/spark-perf

Folders and files

Latest commit

History

Repository files navigation

Spark Performance Tests

Features

Coverage

Dependencies

Configuration

Running locally

Running on an existing Spark cluster

Running on a spark-ec2 cluster with a custom Spark version

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 23

Languages

Packages