[SPARK] Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3 by kbendick · Pull Request #2595 · apache/iceberg

kbendick · 2021-05-14T20:11:06Z

Right now, we have JMH benchmarks to allow performance comparisons between the native spark sources / sinks vs the iceberg sources and sinks in spark.

However, these JMH test suites don't get registered for Spark 3.

I have moved the JMH test code to spark/src/jmh/... and then added that to the src directories for those projects.

Now, depending on which JDK one is using, you can run either Spark 2 or 3 tests (with JDK 8) or Spark 3 tests (with JDK 11).

It would still be possible to add specific benchmarks to Spark 3 or Spark 2 by placing them in spark[2|3]/src/jmh/....

For now, all of the code is left as is and has been copied over (with the command to run updated in each file). I've also added a .gitkeep file in spark2/benchmark and spark3/benchmark so that users can run the commands as is without having to create the folders.

This closes #2590

…ild.gradle to allow for running on spark3 and spark2

…ate running various JMH suites via copying commands

kbendick · 2021-05-14T20:13:46Z

cc @aokolnychyi @RussellSpitzer @flyrain @karuppayya @raptond @rdblue

holdenk

One quick question, but thanks for working on this :)

jmh.gradle

RussellSpitzer · 2021-06-07T17:34:48Z

@kbendick got this error

Execution failed for task ':iceberg-spark2:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error during execution of benchmarks

When i ran
g jmh

on the pr

kbendick · 2021-06-09T19:15:42Z

@kbendick got this error

Execution failed for task ':iceberg-spark2:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error during execution of benchmarks

When i ran
g jmh

on the pr

So just like the current set up (as far as I'm aware), you have to specify the actual test to be run. So something like the following, as mentioned in the existing code here

./gradlew :iceberg-spark2:jmh \ 
   -PjmhIncludeRegex=SparkParquetWritersNestedDataBenchmark \
   -PjmhOutputPath=benchmark/spark-parquet-writers-nested-data-benchmark-result.txt

I will test simply ./gradlew jmh on the current state of master vs my PR, but with my PR, the tests run the same way as mentioned in the doc comments of each test. @RussellSpitzer

kbendick · 2021-06-09T20:35:05Z

@kbendick got this error
Execution failed for task ':iceberg-spark2:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error during execution of benchmarks
When i ran
g jmh
on the pr
So just like the current set up (as far as I'm aware), you have to specify the actual test to be run. So something like the following, as mentioned in the existing code here
./gradlew :iceberg-spark2:jmh \ 
   -PjmhIncludeRegex=SparkParquetWritersNestedDataBenchmark \
   -PjmhOutputPath=benchmark/spark-parquet-writers-nested-data-benchmark-result.txt
I will test simply ./gradlew jmh on the current state of master vs my PR, but with my PR, the tests run the same way as mentioned in the doc comments of each test. @RussellSpitzer

I tried running ./gradlew jmh both from master and from my branch. Both of them failed eventually, but after running for 20-25 minutes and completing at least 3-4 of the benchmarks each time. @RussellSpitzer has now been able to get plain ./gradle jmh to run for over an hour now on this PR branch.

When I run an individual test, as recommended in the comments of each test (or above), it runs just fine to completion.

I think it's more an issue with gradle JVM settings and then running out of memory or other resources, possibly with the number of forked threads when running all of the benchmarks for all of the requested iterations.

TLDR: I don't think we should run jmh itself to run all tests without better configuring gradle. But stick to running the tests as they're mentioned in the test comments. This does not introduce any new issues with the JMH tests, other than possibly the added number of tests if one runs ./gradlew jmh by itself (which is not guaranteed to run to completion anyway with the current state in master).

kbendick · 2021-06-10T05:06:02Z

@kbendick got this error
Execution failed for task ':iceberg-spark2:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error during execution of benchmarks
When i ran
g jmh
on the pr
So just like the current set up (as far as I'm aware), you have to specify the actual test to be run. So something like the following, as mentioned in the existing code here
./gradlew :iceberg-spark2:jmh \ 
   -PjmhIncludeRegex=SparkParquetWritersNestedDataBenchmark \
   -PjmhOutputPath=benchmark/spark-parquet-writers-nested-data-benchmark-result.txt
I will test simply ./gradlew jmh on the current state of master vs my PR, but with my PR, the tests run the same way as mentioned in the doc comments of each test. @RussellSpitzer
I tried running ./gradlew jmh both from master and from my branch. Both of them failed eventually, but after running for 20-25 minutes and completing at least 3-4 of the benchmarks each time. @RussellSpitzer has now been able to get plain ./gradle jmh to run for over an hour now on this PR branch.

When I run an individual test, as recommended in the comments of each test (or above), it runs just fine to completion.

I think it's more an issue with gradle JVM settings and then running out of memory or other resources, possibly with the number of forked threads when running all of the benchmarks for all of the requested iterations.

TLDR: I don't think we should run jmh itself to run all tests without better configuring gradle. But stick to running the tests as they're mentioned in the test comments. This does not introduce any new issues with the JMH tests, other than possibly the added number of tests if one runs ./gradlew jmh by itself (which is not guaranteed to run to completion anyway with the current state in master).

Further follow up - @RussellSpitzer and I were both able to get the tests to run to completion, but users should not run ~/.gradlew jmh without modifying their gradle parameters quite significantly. It's best to run tests one at a time, as mentioned in the tests themselves.

This patch does not change the behavior of the jmh test suite with respect to ability to run to completion from its previous state. It's still possible either way, just rather hard given the number of tests and the limitations of Gradle's JVM (even with the common expanded memory parameters).

flyrain · 2021-06-14T18:17:14Z

Hi @kbendick , sorry for the delayed review. Both spark2 and spark3 jmh commands works well in my laptop with your PR. For Spark3, I can do with both jdk8 and jdk11. Looks good to me overall. One question, is it possible to let gradle create dirs(spark2/benchmark and spark3/benchmark) instead of using .gitkeep files?

kbendick · 2021-06-16T18:36:02Z

Hi @kbendick , sorry for the delayed review. Both spark2 and spark3 jmh commands works well in my laptop with your PR. For Spark3, I can do with both jdk8 and jdk11. Looks good to me overall. One question, is it possible to let gradle create dirs(spark2/benchmark and spark3/benchmark) instead of using .gitkeep files?

I'm admittedly not great with Gradle. I tried to figure out how to get this to work, but couldn't (though I didn't spend too much time on it and probably could get closer if I spent more time on it later). I could look into it further, but the current state in master is that if a user runs the commands as executed, they get a folder not found error. So I figured this was still an upgrade.

By default, if no output path is specified, the results get placed in a results.txt file under the projects /build/.

So given that this is a user specified input that does deviate from the default, it might make more sense to simply leave as is (where the user gets an error and then creates the folder) as opposed to complicating the build files and the separation of tasks into tasks.gradle and the single required jmh.gradle file (the JMH Gradle plugin framework isn't super flexible).

Though I am partial to the gitignore as that ensures nobody uploads results from their laptop or some env that might be very resource constrained and not representative.

Let me take another pass at it for a bit, but then maybe we can just revert those changes and leave it with the current behavior for the moment? That is, just failing if the user overrides the output file - which does provide a very clear folder not found message.

If we really want to have Gradle create the folders (which I agree is a good idea in theory), would it be possible to simply remove the .gitkeep files and the pre-made folders and then try for that in a follow up PR? Given the age of this PR and its size, I'm somewhat anxious to get it in and then start out any more build file changes in another PR. When I took a shot at it, it was clear it was going to complicate the build file somewhat, so I'd like to separate that discussion from this one which has gone on for a while. @flyrain

kbendick · 2021-06-16T19:29:59Z

I spoke to @flyrain offline and he agrees that due to the age of this PR and the complexity it's going to add to the build file (and the subsequent discussion), we should attempt to change the build file to create the output directories in a subsequent PR (as this discussion has already gotten really huge and the build file changes will almost certainly invite more discussion). This is arguably not worse than the current state, though I'm happy to remove the .gitkeep folders and keep it in the current state (where specifying the output path causes a failure until the user creates the directory the first time).

kbendick added 2 commits May 14, 2021 12:59

Move spark2 JMH tests to spark directory and update jmh.gradle and bu…

95dfbfa

…ild.gradle to allow for running on spark3 and spark2

Add gitkeep file to spark2 and spark3 /benchmark directory to facilit…

6d96fad

…ate running various JMH suites via copying commands

github-actions bot added build INFRA spark labels May 14, 2021

kbendick changed the title ~~Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3~~ [SPARK] Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3 May 14, 2021

Add license headers to the gitkeep files

da758e0

holdenk reviewed Jun 6, 2021

View reviewed changes

jmh.gradle Outdated Show resolved Hide resolved

Make spark 2 and 3 work with Java 8

76c1a5a

RussellSpitzer reviewed Jun 7, 2021

View reviewed changes

jmh.gradle Outdated Show resolved Hide resolved

Initialize jmhProjects to empty array before assigning for readability

3a972a7

flyrain approved these changes Jun 15, 2021

View reviewed changes

RussellSpitzer approved these changes Jun 21, 2021

View reviewed changes

RussellSpitzer merged commit ed0c702 into apache:master Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK] Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3#2595

[SPARK] Allow existing spark2 JMH benchmarks to work with either spark 2 or spark 3#2595
RussellSpitzer merged 5 commits intoapache:masterfrom
kbendick:make-spark-jmh-benchmarks-work-with-spark-3

kbendick commented May 14, 2021 •

edited

Loading

Uh oh!

kbendick commented May 14, 2021

Uh oh!

holdenk left a comment

Uh oh!

Uh oh!

Uh oh!

RussellSpitzer commented Jun 7, 2021

Uh oh!

kbendick commented Jun 9, 2021

Uh oh!

kbendick commented Jun 9, 2021

Uh oh!

kbendick commented Jun 10, 2021

Uh oh!

flyrain commented Jun 14, 2021

Uh oh!

kbendick commented Jun 16, 2021 •

edited

Loading

Uh oh!

kbendick commented Jun 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

kbendick commented May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented May 14, 2021

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RussellSpitzer commented Jun 7, 2021

Uh oh!

kbendick commented Jun 9, 2021

Uh oh!

kbendick commented Jun 9, 2021

Uh oh!

kbendick commented Jun 10, 2021

Uh oh!

flyrain commented Jun 14, 2021

Uh oh!

kbendick commented Jun 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented Jun 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

kbendick commented May 14, 2021 •

edited

Loading

kbendick commented Jun 16, 2021 •

edited

Loading