Skip to content
master
Go to file
Code

Latest commit

Various cleanups and fixes made during QA for 1.6 release
* Fixed PIC test, and made it larger
* Fixed NaiveBayes test
* Fixed ALSTest in Scala to use implicitPrefs
* Removed recommend all in ALSTest
* Updated config:
  * elastic net params
  * removed numPoints, numColumns from clustering and used numExamples, numFeatures instead
* set tol to 0 in elastic net tests

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #94 from jkbradley/qa1.6-fixes and squashes the following commits:

f7509b4 [Joseph K. Bradley] set tol to 0 in elastic net tests.  Note this does not affect validity of the 1.6 QA tests since 1.5,1.6 should use tol the same way and have the same defaults
6d0120d [Joseph K. Bradley] fixed config.py.template for elastic-net params, and for eliminating num-points, num-columns
cd7d77e [Joseph K. Bradley] Fixes after initial 1.6 perf tests. * Changed clustering tests to use numExamples, numFeatures instead of numPoints, numColumns * Fixed Scala ALS to use implicitPrefs option in training * Fixed Python NaiveBayes to use numExamples instead of numPoints * Changes to config to increase very short test times
39abc13 [Joseph K. Bradley] made PIC test larger.  removed recommend all in ALSTest
4844b34 [Joseph K. Bradley] update to PICTest
ad7b453 [Joseph K. Bradley] fixed PIC test
5a43830 [Joseph K. Bradley] fixes to get ml tests to run for 1.6 qa
6e4f26d

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
 
 
dev
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Spark Performance Tests

Build Status

This is a performance testing framework for Apache Spark 1.0+.

Features

  • Suites of performance tests for Spark, PySpark, Spark Streaming, and MLlib.
  • Parameterized test configurations:
    • Sweeps sets of parameters to test against multiple Spark and test configurations.
  • Automatically downloads and builds Spark:
    • Maintains a cache of successful builds to enable rapid testing against multiple Spark versions.
  • [...]

For questions, bug reports, or feature requests, please open an issue on GitHub.

Coverage

  • Spark Core RDD
    • list coming soon
  • SQL and DataFrames
    • coming soon
  • Machine Learning
    • glm-regression: Generalized Linear Regression Model
    • glm-classification: Generalized Linear Classification Model
    • naive-bayes: Naive Bayes
    • naive-bayes-bernoulli: Bernoulli Naive Bayes
    • decision-tree: Decision Tree
    • als: Alternating Least Squares
    • kmeans: K-Means clustering
    • gmm: Gaussian Mixture Model
    • svd: Singular Value Decomposition
    • pca: Principal Component Analysis
    • summary-statistics: Summary Statistics (min, max, ...)
    • block-matrix-mult: Matrix Multiplication
    • pearson: Pearson's Correlation
    • spearman: Spearman's Correlation
    • chi-sq-feature/gof/mat: Chi-square Tests
    • word2vec: Word2Vec distributed presentation of words
    • fp-growth: FP-growth frequent item sets
    • python-glm-classification: Generalized Linear Classification Model
    • python-glm-regression: Generalized Linear Regression Model
    • python-naive-bayes: Naive Bayes
    • python-als: Alternating Least Squares
    • python-kmeans: K-Means clustering
    • python-pearson: Pearson's Correlation
    • python-spearman: Spearman's Correlation

Dependencies

The spark-perf scripts require Python 2.7+. If you're using an earlier version of Python, you may need to install the argparse library using easy_install argparse.

Support for automatically building Spark requires Maven. On spark-ec2 clusters, this can be installed using the ./bin/spark-ec2/install-maven script from this project.

Configuration

To configure spark-perf, copy config/config.py.template to config/config.py and edit that file. See config.py.template for detailed configuration instructions. After editing config.py, execute ./bin/run to run performance tests. You can pass the --config option to use a custom configuration file.

The following sections describe some additional settings to change for certain test environments:

Running locally

  1. Set up local SSH server/keys such that ssh localhost works on your machine without a password.

  2. Set config.py options that are friendly for local execution:

    SPARK_HOME_DIR = /path/to/your/spark
    SPARK_CLUSTER_URL = "spark://%s:7077" % socket.gethostname()
    SCALE_FACTOR = .05
    SPARK_DRIVER_MEMORY = 512m
    spark.executor.memory = 2g
    
  3. Uncomment at least one SPARK_TESTS entry.

Running on an existing Spark cluster

  1. SSH into the machine hosting the standalone master

  2. Set config.py options:

    SPARK_HOME_DIR = /path/to/your/spark/install
    SPARK_CLUSTER_URL = "spark://<your-master-hostname>:7077"
    SCALE_FACTOR = <depends on your hardware>
    SPARK_DRIVER_MEMORY = <depends on your hardware>
    spark.executor.memory = <depends on your hardware>
    
  3. Uncomment at least one SPARK_TESTS entry.

Running on a spark-ec2 cluster with a custom Spark version

  1. Launch an EC2 cluster with Spark's EC2 scripts.

  2. Set config.py options:

    USE_CLUSTER_SPARK = False
    SPARK_COMMIT_ID = <what you want test>
    SCALE_FACTOR = <depends on your hardware>
    SPARK_DRIVER_MEMORY = <depends on your hardware>
    spark.executor.memory = <depends on your hardware>
    
  3. Uncomment at least one SPARK_TESTS entry.

License

This project is licensed under the Apache 2.0 License. See LICENSE for full license text.

About

Performance tests for Apache Spark

Resources

License

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.