- Title: Unit Testing for Spark
- Slug: spark-unit-test
- Date: 2019-11-26
- Category: Computer Science
- Tags: programming, Scala, Spark, unit testing, unit test
- Author: Ben Du

## Static Analyzer

If we get the execuation plan, 
then it is quite easy to analyze ...


https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-lineage.html

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-dependencies.html

http://hydronitrogen.com/in-the-code-spark-sql-query-planning-and-execution.html

## Spark Testing Frameworks/Tools

You can use Scala testing frameworks ScalaTest (recommended) and Specs, 
or you can use frameworks/tools developed based on them for Spark specifically.
Various discussions suggests that **Spark Testing Base** is a good one.

https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau

### Spark Unit Testing

1. [Spark Testing Base](https://github.com/holdenk/spark-testing-base)

3. [sscheck](https://github.com/juanrh/sscheck)

### Spark Performance Test

https://github.com/databricks/spark-perf

### Spark Integration Test

https://github.com/databricks/spark-integration-tests

### Spark Job Validation

https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau

QuickCheck/ScalaCheck 

1. QuickCheck generates tests data under a set of constraints 
2. Scala version is ScalaCheck supported by the two unit testing libraries for Spark 
    - sscheck
        + Awesome people
        + supports generating DStreams too! 
    - spark-testing-base 
        + Awesome people
        + generates more pathological (e.g. empty partitions etc.) RDDs 

## Testing Spark Applications

### Good Discussions

http://blog.ippon.tech/testing-strategy-apache-spark-jobs/

http://blog.ippon.tech/testing-strategy-for-spark-streaming/

https://www.youtube.com/watch?v=rOQEiTXNS0g

https://www.slideshare.net/SparkSummit/beyond-parallelize-and-collect-by-holden-karau

https://medium.com/@mrpowers/validating-spark-dataframe-schemas-28d2b3c69d2a

### More

https://medium.com/@mrpowers/testing-spark-applications-8c590d3215fa

http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/

https://dzone.com/articles/testing-spark-code

https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf

https://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/

https://opencredo.com/spark-testing/

http://eugenezhulenev.com/blog/2014/10/18/run-tests-in-standalone-spark-cluster/


## Data Generator 

[This discussion on StackOverflow](https://stackoverflow.com/questions/591892/tools-for-generating-mock-data) suggests that Databene Benerator is good choice.

Question: Does any tool support generating data sets that gurantees that joining returns results? And best if the returned results conver different corner cases.



## Good Tools

1. http://www.softwaretestinghelp.com/tools/40-best-database-testing-tools/

1. http://filldb.info/

1. [DataGenerator](https://github.com/FINRAOS/DataGenerator)

    - handles dependencies
    - for Java

1. [Databene Benerator ](http://databene.org/databene-benerator)
    - handles dependencies 
    - poor documentation

2. [mocker-data-generator](https://github.com/danibram/mocker-data-generator)
    - JS based

## Others

3. [Online Generate CSV Test Data](http://www.convertcsv.com/generate-test-data.htm)
    - doesn't handle dependencies
    - simple and easy to use

4. [GenerateData](https://www.generatedata.com/)
    - doesn't handle dependencies
    - simple and easy to use
    - able to share it with people

5. [Mockaroo](https://www.mockaroo.com)

    - doesn't handle dependencies
    - simple and easy to use

## Commerical Tools

1. [SQL Data Generator](https://www.red-gate.com/products/sql-development/sql-data-generator/)

    - handles dependencis

2. ApexSQL SQL test data generator

## More

1. https://jethro.io/blog/how-to-generate-mock-data-for-testing

2. http://www.bigsynapse.com/sampling-large-datasets-using-spark

3. http://www.bizdatax.com/wp-content/uploads/2015/10/blog-how-to-mask-subset-and-generate-test-data-Img2.png

4. https://github.com/18F/rdbms-subsetter

5. https://docops.ca.com/ca-test-data-manager/3-5/en

6. http://finraos.github.io/DataGenerator/

7. http://sqlblog.com/blogs/jamie_thomson/archive/2009/09/08/deriving-a-list-of-tables-in-dependency-order.aspx

### Data Sanity Checking

http://databene.org/dbsanity

## Load Testing Tools


[Comparison of Locust and Other Load Testing Tools](https://news.ycombinator.com/item?id=9810274)

[Open Source Load Testing Tool Review](http://blog.loadimpact.com/open-source-load-testing-tool-review)

### Locust 

Locust is a tool/framework for writing code that simulates real user behaviour in a fairly realistic way. For example, it's very common to store state for each simulated user. Once you have written your "user behaviour code", you can then simulate a lot of simultaneous users by running it distributed across multiple machines, and hopefully get realistic load sent to you system.

If I wanted to just send a lot of requests/s to one or very few URL endpoints, I would also use something like ApacheBench, and I'm author of Locust.

### [ApacheBench](https://en.wikipedia.org/wiki/ApacheBench)

ApacheBench (ab) is a single-threaded command line computer program 
for measuring the performance of HTTP web servers.[1] 
Originally designed to test the Apache HTTP Server, 
it is generic enough to test any web server.

## Other

1. [PipelineAI](http://pipeline.ai/) looks really interesting!



In [None]:
 val sparkSession: SparkSession = SparkSession.builder()
      .master("local[2]")
      .appName("TestSparkApp")
      .config("spark.sql.shuffle.partitions", "1")
      .config("spark.sql.warehouse.dir", "java.io.tmpdir")
      .getOrCreate()
  import sparkSession.implicits._

## References

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html