Skip to content
Go to file

Latest commit

Various cleanups and fixes made during QA for 1.6 release
* Fixed PIC test, and made it larger
* Fixed NaiveBayes test
* Fixed ALSTest in Scala to use implicitPrefs
* Removed recommend all in ALSTest
* Updated config:
  * elastic net params
  * removed numPoints, numColumns from clustering and used numExamples, numFeatures instead
* set tol to 0 in elastic net tests

CC: mengxr

Author: Joseph K. Bradley <>

Closes #94 from jkbradley/qa1.6-fixes and squashes the following commits:

f7509b4 [Joseph K. Bradley] set tol to 0 in elastic net tests.  Note this does not affect validity of the 1.6 QA tests since 1.5,1.6 should use tol the same way and have the same defaults
6d0120d [Joseph K. Bradley] fixed for elastic-net params, and for eliminating num-points, num-columns
cd7d77e [Joseph K. Bradley] Fixes after initial 1.6 perf tests. * Changed clustering tests to use numExamples, numFeatures instead of numPoints, numColumns * Fixed Scala ALS to use implicitPrefs option in training * Fixed Python NaiveBayes to use numExamples instead of numPoints * Changes to config to increase very short test times
39abc13 [Joseph K. Bradley] made PIC test larger.  removed recommend all in ALSTest
4844b34 [Joseph K. Bradley] update to PICTest
ad7b453 [Joseph K. Bradley] fixed PIC test
5a43830 [Joseph K. Bradley] fixes to get ml tests to run for 1.6 qa

Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Spark Performance Tests

Build Status

This is a performance testing framework for Apache Spark 1.0+.


  • Suites of performance tests for Spark, PySpark, Spark Streaming, and MLlib.
  • Parameterized test configurations:
    • Sweeps sets of parameters to test against multiple Spark and test configurations.
  • Automatically downloads and builds Spark:
    • Maintains a cache of successful builds to enable rapid testing against multiple Spark versions.
  • [...]

For questions, bug reports, or feature requests, please open an issue on GitHub.


  • Spark Core RDD
    • list coming soon
  • SQL and DataFrames
    • coming soon
  • Machine Learning
    • glm-regression: Generalized Linear Regression Model
    • glm-classification: Generalized Linear Classification Model
    • naive-bayes: Naive Bayes
    • naive-bayes-bernoulli: Bernoulli Naive Bayes
    • decision-tree: Decision Tree
    • als: Alternating Least Squares
    • kmeans: K-Means clustering
    • gmm: Gaussian Mixture Model
    • svd: Singular Value Decomposition
    • pca: Principal Component Analysis
    • summary-statistics: Summary Statistics (min, max, ...)
    • block-matrix-mult: Matrix Multiplication
    • pearson: Pearson's Correlation
    • spearman: Spearman's Correlation
    • chi-sq-feature/gof/mat: Chi-square Tests
    • word2vec: Word2Vec distributed presentation of words
    • fp-growth: FP-growth frequent item sets
    • python-glm-classification: Generalized Linear Classification Model
    • python-glm-regression: Generalized Linear Regression Model
    • python-naive-bayes: Naive Bayes
    • python-als: Alternating Least Squares
    • python-kmeans: K-Means clustering
    • python-pearson: Pearson's Correlation
    • python-spearman: Spearman's Correlation


The spark-perf scripts require Python 2.7+. If you're using an earlier version of Python, you may need to install the argparse library using easy_install argparse.

Support for automatically building Spark requires Maven. On spark-ec2 clusters, this can be installed using the ./bin/spark-ec2/install-maven script from this project.


To configure spark-perf, copy config/ to config/ and edit that file. See for detailed configuration instructions. After editing, execute ./bin/run to run performance tests. You can pass the --config option to use a custom configuration file.

The following sections describe some additional settings to change for certain test environments:

Running locally

  1. Set up local SSH server/keys such that ssh localhost works on your machine without a password.

  2. Set options that are friendly for local execution:

    SPARK_HOME_DIR = /path/to/your/spark
    SPARK_CLUSTER_URL = "spark://%s:7077" % socket.gethostname()
    SCALE_FACTOR = .05
    spark.executor.memory = 2g
  3. Uncomment at least one SPARK_TESTS entry.

Running on an existing Spark cluster

  1. SSH into the machine hosting the standalone master

  2. Set options:

    SPARK_HOME_DIR = /path/to/your/spark/install
    SPARK_CLUSTER_URL = "spark://<your-master-hostname>:7077"
    SCALE_FACTOR = <depends on your hardware>
    SPARK_DRIVER_MEMORY = <depends on your hardware>
    spark.executor.memory = <depends on your hardware>
  3. Uncomment at least one SPARK_TESTS entry.

Running on a spark-ec2 cluster with a custom Spark version

  1. Launch an EC2 cluster with Spark's EC2 scripts.

  2. Set options:

    SPARK_COMMIT_ID = <what you want test>
    SCALE_FACTOR = <depends on your hardware>
    SPARK_DRIVER_MEMORY = <depends on your hardware>
    spark.executor.memory = <depends on your hardware>
  3. Uncomment at least one SPARK_TESTS entry.


This project is licensed under the Apache 2.0 License. See LICENSE for full license text.


Performance tests for Apache Spark




No releases published


No packages published
You can’t perform that action at this time.