Skip to content

[SPARK] Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3#2595

Merged
RussellSpitzer merged 5 commits intoapache:masterfrom
kbendick:make-spark-jmh-benchmarks-work-with-spark-3
Jun 21, 2021
Merged

[SPARK] Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3#2595
RussellSpitzer merged 5 commits intoapache:masterfrom
kbendick:make-spark-jmh-benchmarks-work-with-spark-3

Conversation

@kbendick
Copy link
Contributor

@kbendick kbendick commented May 14, 2021

Right now, we have JMH benchmarks to allow performance comparisons between the native spark sources / sinks vs the iceberg sources and sinks in spark.

However, these JMH test suites don't get registered for Spark 3.

I have moved the JMH test code to spark/src/jmh/... and then added that to the src directories for those projects.

Now, depending on which JDK one is using, you can run either Spark 2 or 3 tests (with JDK 8) or Spark 3 tests (with JDK 11).

It would still be possible to add specific benchmarks to Spark 3 or Spark 2 by placing them in spark[2|3]/src/jmh/....

For now, all of the code is left as is and has been copied over (with the command to run updated in each file). I've also added a .gitkeep file in spark2/benchmark and spark3/benchmark so that users can run the commands as is without having to create the folders.

This closes #2590

kbendick added 2 commits May 14, 2021 12:59
…ild.gradle to allow for running on spark3 and spark2
…ate running various JMH suites via copying commands
@kbendick
Copy link
Contributor Author

@kbendick kbendick changed the title Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3 [SPARK] Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3 May 14, 2021
Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick question, but thanks for working on this :)

@RussellSpitzer
Copy link
Member

@kbendick got this error

Execution failed for task ':iceberg-spark2:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error during execution of benchmarks

When i ran
g jmh

on the pr

@kbendick
Copy link
Contributor Author

kbendick commented Jun 9, 2021

@kbendick got this error

Execution failed for task ':iceberg-spark2:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error during execution of benchmarks

When i ran
g jmh

on the pr

So just like the current set up (as far as I'm aware), you have to specify the actual test to be run. So something like the following, as mentioned in the existing code here

./gradlew :iceberg-spark2:jmh \ 
   -PjmhIncludeRegex=SparkParquetWritersNestedDataBenchmark \
   -PjmhOutputPath=benchmark/spark-parquet-writers-nested-data-benchmark-result.txt

I will test simply ./gradlew jmh on the current state of master vs my PR, but with my PR, the tests run the same way as mentioned in the doc comments of each test. @RussellSpitzer

@kbendick
Copy link
Contributor Author

kbendick commented Jun 9, 2021

@kbendick got this error

Execution failed for task ':iceberg-spark2:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error during execution of benchmarks

When i ran
g jmh
on the pr

So just like the current set up (as far as I'm aware), you have to specify the actual test to be run. So something like the following, as mentioned in the existing code here

./gradlew :iceberg-spark2:jmh \ 
   -PjmhIncludeRegex=SparkParquetWritersNestedDataBenchmark \
   -PjmhOutputPath=benchmark/spark-parquet-writers-nested-data-benchmark-result.txt

I will test simply ./gradlew jmh on the current state of master vs my PR, but with my PR, the tests run the same way as mentioned in the doc comments of each test. @RussellSpitzer

I tried running ./gradlew jmh both from master and from my branch. Both of them failed eventually, but after running for 20-25 minutes and completing at least 3-4 of the benchmarks each time. @RussellSpitzer has now been able to get plain ./gradle jmh to run for over an hour now on this PR branch.

When I run an individual test, as recommended in the comments of each test (or above), it runs just fine to completion.

I think it's more an issue with gradle JVM settings and then running out of memory or other resources, possibly with the number of forked threads when running all of the benchmarks for all of the requested iterations.

TLDR: I don't think we should run jmh itself to run all tests without better configuring gradle. But stick to running the tests as they're mentioned in the test comments. This does not introduce any new issues with the JMH tests, other than possibly the added number of tests if one runs ./gradlew jmh by itself (which is not guaranteed to run to completion anyway with the current state in master).

@kbendick
Copy link
Contributor Author

@kbendick got this error

Execution failed for task ':iceberg-spark2:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error during execution of benchmarks

When i ran
g jmh
on the pr

So just like the current set up (as far as I'm aware), you have to specify the actual test to be run. So something like the following, as mentioned in the existing code here

./gradlew :iceberg-spark2:jmh \ 
   -PjmhIncludeRegex=SparkParquetWritersNestedDataBenchmark \
   -PjmhOutputPath=benchmark/spark-parquet-writers-nested-data-benchmark-result.txt

I will test simply ./gradlew jmh on the current state of master vs my PR, but with my PR, the tests run the same way as mentioned in the doc comments of each test. @RussellSpitzer

I tried running ./gradlew jmh both from master and from my branch. Both of them failed eventually, but after running for 20-25 minutes and completing at least 3-4 of the benchmarks each time. @RussellSpitzer has now been able to get plain ./gradle jmh to run for over an hour now on this PR branch.

When I run an individual test, as recommended in the comments of each test (or above), it runs just fine to completion.

I think it's more an issue with gradle JVM settings and then running out of memory or other resources, possibly with the number of forked threads when running all of the benchmarks for all of the requested iterations.

TLDR: I don't think we should run jmh itself to run all tests without better configuring gradle. But stick to running the tests as they're mentioned in the test comments. This does not introduce any new issues with the JMH tests, other than possibly the added number of tests if one runs ./gradlew jmh by itself (which is not guaranteed to run to completion anyway with the current state in master).

Further follow up - @RussellSpitzer and I were both able to get the tests to run to completion, but users should not run ~/.gradlew jmh without modifying their gradle parameters quite significantly. It's best to run tests one at a time, as mentioned in the tests themselves.

This patch does not change the behavior of the jmh test suite with respect to ability to run to completion from its previous state. It's still possible either way, just rather hard given the number of tests and the limitations of Gradle's JVM (even with the common expanded memory parameters).

@flyrain
Copy link
Contributor

flyrain commented Jun 14, 2021

Hi @kbendick , sorry for the delayed review. Both spark2 and spark3 jmh commands works well in my laptop with your PR. For Spark3, I can do with both jdk8 and jdk11. Looks good to me overall. One question, is it possible to let gradle create dirs(spark2/benchmark and spark3/benchmark) instead of using .gitkeep files?

@kbendick
Copy link
Contributor Author

kbendick commented Jun 16, 2021

Hi @kbendick , sorry for the delayed review. Both spark2 and spark3 jmh commands works well in my laptop with your PR. For Spark3, I can do with both jdk8 and jdk11. Looks good to me overall. One question, is it possible to let gradle create dirs(spark2/benchmark and spark3/benchmark) instead of using .gitkeep files?

I'm admittedly not great with Gradle. I tried to figure out how to get this to work, but couldn't (though I didn't spend too much time on it and probably could get closer if I spent more time on it later). I could look into it further, but the current state in master is that if a user runs the commands as executed, they get a folder not found error. So I figured this was still an upgrade.

By default, if no output path is specified, the results get placed in a results.txt file under the projects /build/.

So given that this is a user specified input that does deviate from the default, it might make more sense to simply leave as is (where the user gets an error and then creates the folder) as opposed to complicating the build files and the separation of tasks into tasks.gradle and the single required jmh.gradle file (the JMH Gradle plugin framework isn't super flexible).

Though I am partial to the gitignore as that ensures nobody uploads results from their laptop or some env that might be very resource constrained and not representative.

Let me take another pass at it for a bit, but then maybe we can just revert those changes and leave it with the current behavior for the moment? That is, just failing if the user overrides the output file - which does provide a very clear folder not found message.

If we really want to have Gradle create the folders (which I agree is a good idea in theory), would it be possible to simply remove the .gitkeep files and the pre-made folders and then try for that in a follow up PR? Given the age of this PR and its size, I'm somewhat anxious to get it in and then start out any more build file changes in another PR. When I took a shot at it, it was clear it was going to complicate the build file somewhat, so I'd like to separate that discussion from this one which has gone on for a while. @flyrain

@kbendick
Copy link
Contributor Author

I spoke to @flyrain offline and he agrees that due to the age of this PR and the complexity it's going to add to the build file (and the subsequent discussion), we should attempt to change the build file to create the output directories in a subsequent PR (as this discussion has already gotten really huge and the build file changes will almost certainly invite more discussion). This is arguably not worse than the current state, though I'm happy to remove the .gitkeep folders and keep it in the current state (where specifying the output path causes a failure until the user creates the directory the first time).

@RussellSpitzer RussellSpitzer merged commit ed0c702 into apache:master Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add JMH tests for spark 3 to compare Iceberg source and sink with native Spark source and sink

4 participants

Comments