-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-3250] Migrate Flink and Spark ValidatesRunner to Gradle #4418
Conversation
Assign me when this is ready for review. |
7e8e604
to
ca84534
Compare
Should be ready now. The inline TODOs are questions for you. |
r: @lukecwik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add top level targets to the root build.gradle which depend on the tasks added here. This way someone can write just ./gradlew :sparkValidatesRunner
and it will run all spark validates runner tests (streaming, batch, validates runner in different subprojects, ...)
sdks/java/core/build.gradle
Outdated
// List of test categories to exclude from this task. Optional. | ||
List<String> excludes | ||
// Pipeline options command line arguments. | ||
// TODO: Can option keys be specified multiple times? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option keys cannot be specified multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
sdks/java/core/build.gradle
Outdated
classpath = config.configuration | ||
useJUnit { | ||
includeCategories 'org.apache.beam.sdk.testing.ValidatesRunner' | ||
if (config.excludes != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null -> false in Groovy
if (config.excludes) {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sdks/java/core/build.gradle
Outdated
@@ -117,6 +132,113 @@ task packageTests(type: Jar) { | |||
classifier = "tests" | |||
} | |||
|
|||
class ValidatesRunnerConfig { | |||
// Task name prefix | |||
String testPrefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge testPrefix, runnerName and configuration all into one field so that we just standardize how everything is defined like in examples/java/build.gradle for the precommits.
sdks/java/core/build.gradle
Outdated
tasks.create(name: "${config.testPrefix}ValidatesRunner", type: Test) { | ||
group = "Verification" | ||
description = "Validate ${config.runnerName} runner" | ||
def optionsList = config.pipelineOptions.collect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JsonOutput.toJson? like in examples/java/build.gradle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'll do it.
Map<String, String> systemProperties | ||
} | ||
|
||
def createValidatesRunner(Map m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the implicit it
parameter which allows you to call this method with zero or more args, see:
https://github.com/apache/beam/blob/master/build_rules.gradle#L164
I like the method but this could be built as a for loop over a set of maps and known runners/configurations like in examples/java. Using the for loop attempts to adhere that as few specializations per runner are added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it harms readability for people unfamiliar with groovy. I also find this more readable than the examples build file because configurations are explicitly linked by ValidatesRunnerConfig
rather than name substitution. If nothing else, this at least makes it less likely that task/variable names collide unintentionally.
I do like the idea of minimizing specializations for runners, but at the end of the day, different runners have different configurations: classpaths, excluded categories, and even forking strategies. Does using a bunch of string maps really help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about putting all configs into a list/map and iterating over that in a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why the implicit it
is generally better is that you can reflectively either take a map of named parameters or a ValidatesRunnerConfig directly and handle each case internally to the method. The third reason why its beneficial is because you can make a method that takes zero args and then uses the defaults but in this case this benefit is pointless since some args are required.
sdks/java/core/build.gradle
Outdated
def pipelineOptions = "[${optionsList.join(",")}]" | ||
|
||
systemProperty "beamTestPipelineOptions", pipelineOptions | ||
if (config.systemProperties != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null -> false in Groovy
if (config.systemProperties) {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
For reference, is this idiomatic groovy? I can see this causing issues with optional Boolean
fields that we want to effectively default to true
or false
.
sdks/java/core/build.gradle
Outdated
flinkValidatesRunner project(path: ":beam-runners-parent:beam-runners-flink_2.11", configuration: "shadow") | ||
sparkValidatesRunner project(path: project.path, configuration: "shadowTest") | ||
sparkValidatesRunner project(path: ":beam-runners-parent:beam-runners-spark", configuration: "shadow") | ||
// TODO: Is there a way to refer to the `provided` configuration from the Spark build? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, "provided" is just another configuration. You should be able to say project(path: ..., configuration: "provided"). Double check the inheritance structure of the "provided" configuration to know what your pulling in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including the provided dependency does appear to work. I'm unable to find documentation about the exact semantics of the provided dependencies. Given that the tests pass if I add explicit dependencies to spark-core and spark-streaming, should I just leave it as is?
@@ -38,6 +38,14 @@ processResources { | |||
] | |||
} | |||
|
|||
configurations { | |||
flinkValidatesRunner | |||
sparkValidatesRunner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't understand why this wouldn't be runners:spark:ValidatesRunner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The surefire plugin in our maven build allows for a dependenciesToScan
section which searches for test classes within arbitrary dependencies.
Gradle does not support this mechanism and instead requires that either a source file directory or compiled class directory be explicitly provided for scanning. Gradle also does not provide a way to automatically generate such a class directory from dependencies. Instead, you have to:
- Add a hard evaluation dependency on the module that defines the tests
- Reach into that module's build definition to access any jars containing tests you want to run
- Copy over and unzip zip jars into a compiled class directory
- Point the tests at this unzipped class directory
This is cumbersome and error-prone but also makes the build brittle and inflexible because of the module evaluation dependency. I took a stab at it previously, which you can see here. (Let me know if there's a way to persist old branch references because I'll probably delete that branch once this is in.)
Given that inverting the test structure Just Works™ with gradle without introducing module cycles, I think this is the better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't tasks but are configurations so the question doesn't make total sense.
If your asking why aren't these defined within runners/spark/build.gradle
, runners/flink/build.gradle
then it is because Gradle's Java plugin test task type can only run tests from a directory which would require unzipping the sdk core test jar artifact. Ben and I tried this initially but were unable to find a good way to get the artifact without reaching into the gradle project directly like we do with sourcesets.test.output
which we want to avoid.
I also do like having these all defined in one place because then a runner just adds a little configuration to the list and it also makes a common place to maintain them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukecwik While this should group validates-runner tests nicely in most cases, Spark appears to group a few of its own tests into the maven validates-runner profile. If we insist on translating all maven validates-runner profiles into gradle and keeping them in one place, then we'll have to reach into another package at some point.
I'm for going the gradle way and splitting tests into the packages that define them. As long as those tests in the spark module run somehow, I don't think it matters that they aren't bundled with the rest of the validates-runner tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with a workaround to get this started, but would keep the goal that we could e.g. delete a runner directory entirely without affecting the core SDK build. So textual references are also a form of dependency I'd prefer to avoid. When the dependency graph of tasks doesn't match the dependency graph of modules, you have some real spaghetti potential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that there exists a file under sdks/java/core
that references a particular runner. That is undesirable. But we should go ahead with this now and figure out how to get gradle to do it right later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll leave it as is for now. As a longer-term fix, how do you feel about separating out ValidatesRunner into its own gradle module (and directory) and having that depend on each runner? I wouldn't want to make that change before removing the maven build entirely, but I'm trying to figure out if there's a proper way forward from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some additional massaging, I was able to get the tests running from within the Flink directory without manually unzipping jars. It does impose a strict evaluation dependency between modules, but at least it goes in the direction we would prefer. I'll add that as a separate commit on top of this. Let me know if this is satisfactory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a place to look at the change, either a commit on github or another PR that is not for review to show what you tried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, found it here: #4436
Side note: I know you are focused on particular runners, but for the gradle migration we should make it very easy to add another runner's ValidatesRunner config, so we can get all the ones one master, and preferably the ones that are active on branches (some on branches have stagnated a bit). |
@kennknowles The ease of migration is exactly why I decided to structure it this way. Feel free to experiment or suggest concrete ways we could do this better. As far as @lukecwik and I can tell, this path gives us the least friction and should be easiest to maintain. |
In Gradle, the validates runner tasks live under `sdks/java/core` because Gradle does not support directly running tests from module dependencies.
f9938b5
to
4dd7309
Compare
Map<String, String> systemProperties | ||
} | ||
|
||
def createValidatesRunner(Map m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why the implicit it
is generally better is that you can reflectively either take a map of named parameters or a ValidatesRunnerConfig directly and handle each case internally to the method. The third reason why its beneficial is because you can make a method that takes zero args and then uses the defaults but in this case this benefit is pointless since some args are required.
Map<String, String> systemProperties | ||
} | ||
|
||
def createValidatesRunner(Map m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createValidatesRunner
-> createValidatesRunnerTask
Also add a comment as to what the method does, what is the expected input parameter, ...
@@ -117,6 +132,110 @@ task packageTests(type: Jar) { | |||
classifier = "tests" | |||
} | |||
|
|||
class ValidatesRunnerConfig { | |||
// Runner name prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really is the taskNamePrefix
class ValidatesRunnerConfig { | ||
// Runner name prefix | ||
String runner | ||
// List of test categories to exclude from this task. Optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add the word optional to the beginning of the comments.
// Pipeline options command line arguments. | ||
Map<String, String> pipelineOptions | ||
// Configuration to use for test runtime classpath. | ||
FileCollection configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a Configuration type instead of FileCollection.
@@ -38,6 +38,14 @@ processResources { | |||
] | |||
} | |||
|
|||
configurations { | |||
flinkValidatesRunner | |||
sparkValidatesRunner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a place to look at the change, either a commit on github or another PR that is not for review to show what you tried?
The other PR seems much nicer, feel free to close this. |
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue.mvn clean verify
to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.Note that Gradle does not support scanning arbitrary jars/dependencies for test classes. For this reason, this change inverts structure of the validates-runner tests, moving them into
sdks/java/core
rather than the runner modules themselves. This does not create a cyclic dependency because dependencies are analyzed by module/configuration rather than by module.On my machine, the Flink validates-runner tests take about 20 minutes to run on maven and about 5 minutes to run with Gradle.