[BEAM-3250] Migrate Flink and Spark ValidatesRunner to Gradle #4418

bsidhom · 2018-01-16T21:56:58Z

Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
Each commit in the pull request should have a meaningful subject line and body.
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Note that Gradle does not support scanning arbitrary jars/dependencies for test classes. For this reason, this change inverts structure of the validates-runner tests, moving them into sdks/java/core rather than the runner modules themselves. This does not create a cyclic dependency because dependencies are analyzed by module/configuration rather than by module.

On my machine, the Flink validates-runner tests take about 20 minutes to run on maven and about 5 minutes to run with Gradle.

lukecwik · 2018-01-17T00:03:35Z

Assign me when this is ready for review.

bsidhom · 2018-01-17T00:12:39Z

Should be ready now. The inline TODOs are questions for you.

bsidhom · 2018-01-17T00:14:27Z

r: @lukecwik

lukecwik

Add top level targets to the root build.gradle which depend on the tasks added here. This way someone can write just ./gradlew :sparkValidatesRunner and it will run all spark validates runner tests (streaming, batch, validates runner in different subprojects, ...)

lukecwik · 2018-01-17T00:16:11Z

sdks/java/core/build.gradle

+  // List of test categories to exclude from this task. Optional.
+  List<String> excludes
+  // Pipeline options command line arguments.
+  // TODO: Can option keys be specified multiple times?


Option keys cannot be specified multiple times.

lukecwik · 2018-01-17T00:18:03Z

sdks/java/core/build.gradle

+    classpath = config.configuration
+    useJUnit {
+      includeCategories 'org.apache.beam.sdk.testing.ValidatesRunner'
+      if (config.excludes != null) {


null -> false in Groovy

if (config.excludes) {

}

lukecwik · 2018-01-17T00:30:54Z

sdks/java/core/build.gradle

@@ -117,6 +132,113 @@ task packageTests(type: Jar) {
  classifier = "tests"
 }

+class ValidatesRunnerConfig {
+  // Task name prefix
+  String testPrefix


Merge testPrefix, runnerName and configuration all into one field so that we just standardize how everything is defined like in examples/java/build.gradle for the precommits.

lukecwik · 2018-01-17T00:31:48Z

sdks/java/core/build.gradle

+  tasks.create(name: "${config.testPrefix}ValidatesRunner", type: Test) {
+    group = "Verification"
+    description = "Validate ${config.runnerName} runner"
+    def optionsList = config.pipelineOptions.collect {


JsonOutput.toJson? like in examples/java/build.gradle?

That'll do it.

lukecwik · 2018-01-17T00:35:29Z

sdks/java/core/build.gradle

+  Map<String, String> systemProperties
+}
+
+def createValidatesRunner(Map m) {


Use the implicit it parameter which allows you to call this method with zero or more args, see:
https://github.com/apache/beam/blob/master/build_rules.gradle#L164

I like the method but this could be built as a for loop over a set of maps and known runners/configurations like in examples/java. Using the for loop attempts to adhere that as few specializations per runner are added.

This seems like it harms readability for people unfamiliar with groovy. I also find this more readable than the examples build file because configurations are explicitly linked by ValidatesRunnerConfig rather than name substitution. If nothing else, this at least makes it less likely that task/variable names collide unintentionally.

I do like the idea of minimizing specializations for runners, but at the end of the day, different runners have different configurations: classpaths, excluded categories, and even forking strategies. Does using a bunch of string maps really help?

How about putting all configs into a list/map and iterating over that in a loop?

The reason why the implicit it is generally better is that you can reflectively either take a map of named parameters or a ValidatesRunnerConfig directly and handle each case internally to the method. The third reason why its beneficial is because you can make a method that takes zero args and then uses the defaults but in this case this benefit is pointless since some args are required.

lukecwik · 2018-01-17T00:36:44Z

sdks/java/core/build.gradle

+    def pipelineOptions = "[${optionsList.join(",")}]"
+
+    systemProperty "beamTestPipelineOptions", pipelineOptions
+    if (config.systemProperties != null) {


null -> false in Groovy

if (config.systemProperties) {

}

Done.

For reference, is this idiomatic groovy? I can see this causing issues with optional Boolean fields that we want to effectively default to true or false.

lukecwik · 2018-01-17T00:41:42Z

sdks/java/core/build.gradle

+  flinkValidatesRunner project(path: ":beam-runners-parent:beam-runners-flink_2.11", configuration: "shadow")
+  sparkValidatesRunner project(path: project.path, configuration: "shadowTest")
+  sparkValidatesRunner project(path: ":beam-runners-parent:beam-runners-spark", configuration: "shadow")
+  // TODO: Is there a way to refer to the `provided` configuration from the Spark build?


yes, "provided" is just another configuration. You should be able to say project(path: ..., configuration: "provided"). Double check the inheritance structure of the "provided" configuration to know what your pulling in.

Including the provided dependency does appear to work. I'm unable to find documentation about the exact semantics of the provided dependencies. Given that the tests pass if I add explicit dependencies to spark-core and spark-streaming, should I just leave it as is?

kennknowles · 2018-01-17T06:18:39Z

sdks/java/core/build.gradle

@@ -38,6 +38,14 @@ processResources {
  ]
 }

+configurations {
+    flinkValidatesRunner
+    sparkValidatesRunner {


I actually don't understand why this wouldn't be runners:spark:ValidatesRunner

The surefire plugin in our maven build allows for a dependenciesToScan section which searches for test classes within arbitrary dependencies.

Gradle does not support this mechanism and instead requires that either a source file directory or compiled class directory be explicitly provided for scanning. Gradle also does not provide a way to automatically generate such a class directory from dependencies. Instead, you have to:

Add a hard evaluation dependency on the module that defines the tests

Reach into that module's build definition to access any jars containing tests you want to run

Copy over and unzip zip jars into a compiled class directory

Point the tests at this unzipped class directory

This is cumbersome and error-prone but also makes the build brittle and inflexible because of the module evaluation dependency. I took a stab at it previously, which you can see here. (Let me know if there's a way to persist old branch references because I'll probably delete that branch once this is in.)

Given that inverting the test structure Just Works™ with gradle without introducing module cycles, I think this is the better way.

These aren't tasks but are configurations so the question doesn't make total sense.

If your asking why aren't these defined within runners/spark/build.gradle, runners/flink/build.gradle then it is because Gradle's Java plugin test task type can only run tests from a directory which would require unzipping the sdk core test jar artifact. Ben and I tried this initially but were unable to find a good way to get the artifact without reaching into the gradle project directly like we do with sourcesets.test.output which we want to avoid.

I also do like having these all defined in one place because then a runner just adds a little configuration to the list and it also makes a common place to maintain them all.

@lukecwik While this should group validates-runner tests nicely in most cases, Spark appears to group a few of its own tests into the maven validates-runner profile. If we insist on translating all maven validates-runner profiles into gradle and keeping them in one place, then we'll have to reach into another package at some point.

I'm for going the gradle way and splitting tests into the packages that define them. As long as those tests in the spark module run somehow, I don't think it matters that they aren't bundled with the rest of the validates-runner tests.

I'm OK with a workaround to get this started, but would keep the goal that we could e.g. delete a runner directory entirely without affecting the core SDK build. So textual references are also a form of dependency I'd prefer to avoid. When the dependency graph of tasks doesn't match the dependency graph of modules, you have some real spaghetti potential.

I mean that there exists a file under sdks/java/core that references a particular runner. That is undesirable. But we should go ahead with this now and figure out how to get gradle to do it right later.

OK. I'll leave it as is for now. As a longer-term fix, how do you feel about separating out ValidatesRunner into its own gradle module (and directory) and having that depend on each runner? I wouldn't want to make that change before removing the maven build entirely, but I'm trying to figure out if there's a proper way forward from here.

After some additional massaging, I was able to get the tests running from within the Flink directory without manually unzipping jars. It does impose a strict evaluation dependency between modules, but at least it goes in the direction we would prefer. I'll add that as a separate commit on top of this. Let me know if this is satisfactory.

Do you have a place to look at the change, either a commit on github or another PR that is not for review to show what you tried?

Nevermind, found it here: #4436

kennknowles · 2018-01-17T21:32:14Z

Side note: I know you are focused on particular runners, but for the gradle migration we should make it very easy to add another runner's ValidatesRunner config, so we can get all the ones one master, and preferably the ones that are active on branches (some on branches have stagnated a bit).

bsidhom · 2018-01-17T22:03:40Z

@kennknowles The ease of migration is exactly why I decided to structure it this way. Feel free to experiment or suggest concrete ways we could do this better. As far as @lukecwik and I can tell, this path gives us the least friction and should be easiest to maintain.

In Gradle, the validates runner tasks live under `sdks/java/core` because Gradle does not support directly running tests from module dependencies.

lukecwik · 2018-01-24T05:58:49Z

sdks/java/core/build.gradle

+  Map<String, String> systemProperties
+}
+
+def createValidatesRunner(Map m) {


The reason why the implicit it is generally better is that you can reflectively either take a map of named parameters or a ValidatesRunnerConfig directly and handle each case internally to the method. The third reason why its beneficial is because you can make a method that takes zero args and then uses the defaults but in this case this benefit is pointless since some args are required.

lukecwik · 2018-01-24T06:02:51Z

sdks/java/core/build.gradle

+  Map<String, String> systemProperties
+}
+
+def createValidatesRunner(Map m) {


createValidatesRunner -> createValidatesRunnerTask

Also add a comment as to what the method does, what is the expected input parameter, ...

lukecwik · 2018-01-24T06:04:36Z

sdks/java/core/build.gradle

@@ -117,6 +132,110 @@ task packageTests(type: Jar) {
  classifier = "tests"
 }

+class ValidatesRunnerConfig {
+  // Runner name prefix


This really is the taskNamePrefix

lukecwik · 2018-01-24T06:04:59Z

sdks/java/core/build.gradle

+class ValidatesRunnerConfig {
+  // Runner name prefix
+  String runner
+  // List of test categories to exclude from this task. Optional.


nit: Add the word optional to the beginning of the comments.

lukecwik · 2018-01-24T06:06:47Z

sdks/java/core/build.gradle

+  // Pipeline options command line arguments.
+  Map<String, String> pipelineOptions
+  // Configuration to use for test runtime classpath.
+  FileCollection configuration


This should be a Configuration type instead of FileCollection.

lukecwik · 2018-01-24T06:08:23Z

sdks/java/core/build.gradle

@@ -38,6 +38,14 @@ processResources {
  ]
 }

+configurations {
+    flinkValidatesRunner
+    sparkValidatesRunner {


Do you have a place to look at the change, either a commit on github or another PR that is not for review to show what you tried?

lukecwik · 2018-01-24T06:14:18Z

The other PR seems much nicer, feel free to close this.

lukecwik self-requested a review January 17, 2018 00:02

bsidhom force-pushed the gradle-validates-runner branch from 7e8e604 to ca84534 Compare January 17, 2018 00:10

lukecwik self-assigned this Jan 17, 2018

lukecwik requested changes Jan 17, 2018

View reviewed changes

kennknowles reviewed Jan 17, 2018

View reviewed changes

bsidhom mentioned this pull request Jan 18, 2018

[BEAM-3250] Migrate Flink ValidatesRunner to Gradle #4436

Merged

6 tasks

bsidhom added 6 commits January 18, 2018 14:20

[BEAM-3250] Add validates runner tasks for Flink and Spark

5ffec51

In Gradle, the validates runner tasks live under `sdks/java/core` because Gradle does not support directly running tests from module dependencies.

Use spark runner provided configuration for tests

f828a6d

Merge runner names and prefixes

5c21b57

Address review comments

ee0f387

Define validates-runners tasks in a loop

c13634a

Fix project paths

4dd7309

bsidhom force-pushed the gradle-validates-runner branch from f9938b5 to 4dd7309 Compare January 18, 2018 22:23

lukecwik reviewed Jan 24, 2018

View reviewed changes

bsidhom closed this Jan 24, 2018

bsidhom deleted the gradle-validates-runner branch February 7, 2018 19:13

[BEAM-3250] Migrate Flink and Spark ValidatesRunner to Gradle #4418

[BEAM-3250] Migrate Flink and Spark ValidatesRunner to Gradle #4418

Conversation

bsidhom commented Jan 16, 2018 • edited Loading

lukecwik commented Jan 17, 2018

bsidhom commented Jan 17, 2018

bsidhom commented Jan 17, 2018

lukecwik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kennknowles commented Jan 17, 2018

bsidhom commented Jan 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukecwik commented Jan 24, 2018

bsidhom commented Jan 16, 2018 •

edited

Loading