Performance tests for Apache Spark
Switch branches/tags
Nothing to show
Clone or download
jkbradley and mengxr QA 1.6 fixes and cleanups
Various cleanups and fixes made during QA for 1.6 release
* Fixed PIC test, and made it larger
* Fixed NaiveBayes test
* Fixed ALSTest in Scala to use implicitPrefs
* Removed recommend all in ALSTest
* Updated config:
  * elastic net params
  * removed numPoints, numColumns from clustering and used numExamples, numFeatures instead
* set tol to 0 in elastic net tests

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #94 from jkbradley/qa1.6-fixes and squashes the following commits:

f7509b4 [Joseph K. Bradley] set tol to 0 in elastic net tests.  Note this does not affect validity of the 1.6 QA tests since 1.5,1.6 should use tol the same way and have the same defaults
6d0120d [Joseph K. Bradley] fixed config.py.template for elastic-net params, and for eliminating num-points, num-columns
cd7d77e [Joseph K. Bradley] Fixes after initial 1.6 perf tests. * Changed clustering tests to use numExamples, numFeatures instead of numPoints, numColumns * Fixed Scala ALS to use implicitPrefs option in training * Fixed Python NaiveBayes to use numExamples instead of numPoints * Changes to config to increase very short test times
39abc13 [Joseph K. Bradley] made PIC test larger.  removed recommend all in ALSTest
4844b34 [Joseph K. Bradley] update to PICTest
ad7b453 [Joseph K. Bradley] fixed PIC test
5a43830 [Joseph K. Bradley] fixes to get ml tests to run for 1.6 qa
Latest commit 6e4f26d Dec 9, 2015

README.md

Spark Performance Tests

Build Status

This is a performance testing framework for Apache Spark 1.0+.

Features

  • Suites of performance tests for Spark, PySpark, Spark Streaming, and MLlib.
  • Parameterized test configurations:
    • Sweeps sets of parameters to test against multiple Spark and test configurations.
  • Automatically downloads and builds Spark:
    • Maintains a cache of successful builds to enable rapid testing against multiple Spark versions.
  • [...]

For questions, bug reports, or feature requests, please open an issue on GitHub.

Coverage

  • Spark Core RDD
    • list coming soon
  • SQL and DataFrames
    • coming soon
  • Machine Learning
    • glm-regression: Generalized Linear Regression Model
    • glm-classification: Generalized Linear Classification Model
    • naive-bayes: Naive Bayes
    • naive-bayes-bernoulli: Bernoulli Naive Bayes
    • decision-tree: Decision Tree
    • als: Alternating Least Squares
    • kmeans: K-Means clustering
    • gmm: Gaussian Mixture Model
    • svd: Singular Value Decomposition
    • pca: Principal Component Analysis
    • summary-statistics: Summary Statistics (min, max, ...)
    • block-matrix-mult: Matrix Multiplication
    • pearson: Pearson's Correlation
    • spearman: Spearman's Correlation
    • chi-sq-feature/gof/mat: Chi-square Tests
    • word2vec: Word2Vec distributed presentation of words
    • fp-growth: FP-growth frequent item sets
    • python-glm-classification: Generalized Linear Classification Model
    • python-glm-regression: Generalized Linear Regression Model
    • python-naive-bayes: Naive Bayes
    • python-als: Alternating Least Squares
    • python-kmeans: K-Means clustering
    • python-pearson: Pearson's Correlation
    • python-spearman: Spearman's Correlation

Dependencies

The spark-perf scripts require Python 2.7+. If you're using an earlier version of Python, you may need to install the argparse library using easy_install argparse.

Support for automatically building Spark requires Maven. On spark-ec2 clusters, this can be installed using the ./bin/spark-ec2/install-maven script from this project.

Configuration

To configure spark-perf, copy config/config.py.template to config/config.py and edit that file. See config.py.template for detailed configuration instructions. After editing config.py, execute ./bin/run to run performance tests. You can pass the --config option to use a custom configuration file.

The following sections describe some additional settings to change for certain test environments:

Running locally

  1. Set up local SSH server/keys such that ssh localhost works on your machine without a password.

  2. Set config.py options that are friendly for local execution:

    SPARK_HOME_DIR = /path/to/your/spark
    SPARK_CLUSTER_URL = "spark://%s:7077" % socket.gethostname()
    SCALE_FACTOR = .05
    SPARK_DRIVER_MEMORY = 512m
    spark.executor.memory = 2g
    
  3. Uncomment at least one SPARK_TESTS entry.

Running on an existing Spark cluster

  1. SSH into the machine hosting the standalone master

  2. Set config.py options:

    SPARK_HOME_DIR = /path/to/your/spark/install
    SPARK_CLUSTER_URL = "spark://<your-master-hostname>:7077"
    SCALE_FACTOR = <depends on your hardware>
    SPARK_DRIVER_MEMORY = <depends on your hardware>
    spark.executor.memory = <depends on your hardware>
    
  3. Uncomment at least one SPARK_TESTS entry.

Running on a spark-ec2 cluster with a custom Spark version

  1. Launch an EC2 cluster with Spark's EC2 scripts.

  2. Set config.py options:

    USE_CLUSTER_SPARK = False
    SPARK_COMMIT_ID = <what you want test>
    SCALE_FACTOR = <depends on your hardware>
    SPARK_DRIVER_MEMORY = <depends on your hardware>
    spark.executor.memory = <depends on your hardware>
    
  3. Uncomment at least one SPARK_TESTS entry.

License

This project is licensed under the Apache 2.0 License. See LICENSE for full license text.