Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-3250] Migrate Flink and Spark ValidatesRunner to Gradle #4418

Closed
wants to merge 6 commits into from

Conversation

bsidhom
Copy link
Contributor

@bsidhom bsidhom commented Jan 16, 2018

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Note that Gradle does not support scanning arbitrary jars/dependencies for test classes. For this reason, this change inverts structure of the validates-runner tests, moving them into sdks/java/core rather than the runner modules themselves. This does not create a cyclic dependency because dependencies are analyzed by module/configuration rather than by module.

On my machine, the Flink validates-runner tests take about 20 minutes to run on maven and about 5 minutes to run with Gradle.

@lukecwik lukecwik self-requested a review January 17, 2018 00:02
@lukecwik
Copy link
Member

Assign me when this is ready for review.

@bsidhom
Copy link
Contributor Author

bsidhom commented Jan 17, 2018

Should be ready now. The inline TODOs are questions for you.

@bsidhom
Copy link
Contributor Author

bsidhom commented Jan 17, 2018

r: @lukecwik

@lukecwik lukecwik self-assigned this Jan 17, 2018
Copy link
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add top level targets to the root build.gradle which depend on the tasks added here. This way someone can write just ./gradlew :sparkValidatesRunner and it will run all spark validates runner tests (streaming, batch, validates runner in different subprojects, ...)

// List of test categories to exclude from this task. Optional.
List<String> excludes
// Pipeline options command line arguments.
// TODO: Can option keys be specified multiple times?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option keys cannot be specified multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

classpath = config.configuration
useJUnit {
includeCategories 'org.apache.beam.sdk.testing.ValidatesRunner'
if (config.excludes != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null -> false in Groovy

if (config.excludes) {

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -117,6 +132,113 @@ task packageTests(type: Jar) {
classifier = "tests"
}

class ValidatesRunnerConfig {
// Task name prefix
String testPrefix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge testPrefix, runnerName and configuration all into one field so that we just standardize how everything is defined like in examples/java/build.gradle for the precommits.

tasks.create(name: "${config.testPrefix}ValidatesRunner", type: Test) {
group = "Verification"
description = "Validate ${config.runnerName} runner"
def optionsList = config.pipelineOptions.collect {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JsonOutput.toJson? like in examples/java/build.gradle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'll do it.

Map<String, String> systemProperties
}

def createValidatesRunner(Map m) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the implicit it parameter which allows you to call this method with zero or more args, see:
https://github.com/apache/beam/blob/master/build_rules.gradle#L164

I like the method but this could be built as a for loop over a set of maps and known runners/configurations like in examples/java. Using the for loop attempts to adhere that as few specializations per runner are added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it harms readability for people unfamiliar with groovy. I also find this more readable than the examples build file because configurations are explicitly linked by ValidatesRunnerConfig rather than name substitution. If nothing else, this at least makes it less likely that task/variable names collide unintentionally.

I do like the idea of minimizing specializations for runners, but at the end of the day, different runners have different configurations: classpaths, excluded categories, and even forking strategies. Does using a bunch of string maps really help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about putting all configs into a list/map and iterating over that in a loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why the implicit it is generally better is that you can reflectively either take a map of named parameters or a ValidatesRunnerConfig directly and handle each case internally to the method. The third reason why its beneficial is because you can make a method that takes zero args and then uses the defaults but in this case this benefit is pointless since some args are required.

def pipelineOptions = "[${optionsList.join(",")}]"

systemProperty "beamTestPipelineOptions", pipelineOptions
if (config.systemProperties != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null -> false in Groovy

if (config.systemProperties) {

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

For reference, is this idiomatic groovy? I can see this causing issues with optional Boolean fields that we want to effectively default to true or false.

flinkValidatesRunner project(path: ":beam-runners-parent:beam-runners-flink_2.11", configuration: "shadow")
sparkValidatesRunner project(path: project.path, configuration: "shadowTest")
sparkValidatesRunner project(path: ":beam-runners-parent:beam-runners-spark", configuration: "shadow")
// TODO: Is there a way to refer to the `provided` configuration from the Spark build?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, "provided" is just another configuration. You should be able to say project(path: ..., configuration: "provided"). Double check the inheritance structure of the "provided" configuration to know what your pulling in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including the provided dependency does appear to work. I'm unable to find documentation about the exact semantics of the provided dependencies. Given that the tests pass if I add explicit dependencies to spark-core and spark-streaming, should I just leave it as is?

@@ -38,6 +38,14 @@ processResources {
]
}

configurations {
flinkValidatesRunner
sparkValidatesRunner {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't understand why this wouldn't be runners:spark:ValidatesRunner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The surefire plugin in our maven build allows for a dependenciesToScan section which searches for test classes within arbitrary dependencies.

Gradle does not support this mechanism and instead requires that either a source file directory or compiled class directory be explicitly provided for scanning. Gradle also does not provide a way to automatically generate such a class directory from dependencies. Instead, you have to:

  • Add a hard evaluation dependency on the module that defines the tests
  • Reach into that module's build definition to access any jars containing tests you want to run
  • Copy over and unzip zip jars into a compiled class directory
  • Point the tests at this unzipped class directory

This is cumbersome and error-prone but also makes the build brittle and inflexible because of the module evaluation dependency. I took a stab at it previously, which you can see here. (Let me know if there's a way to persist old branch references because I'll probably delete that branch once this is in.)

Given that inverting the test structure Just Works™ with gradle without introducing module cycles, I think this is the better way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't tasks but are configurations so the question doesn't make total sense.

If your asking why aren't these defined within runners/spark/build.gradle, runners/flink/build.gradle then it is because Gradle's Java plugin test task type can only run tests from a directory which would require unzipping the sdk core test jar artifact. Ben and I tried this initially but were unable to find a good way to get the artifact without reaching into the gradle project directly like we do with sourcesets.test.output which we want to avoid.

I also do like having these all defined in one place because then a runner just adds a little configuration to the list and it also makes a common place to maintain them all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukecwik While this should group validates-runner tests nicely in most cases, Spark appears to group a few of its own tests into the maven validates-runner profile. If we insist on translating all maven validates-runner profiles into gradle and keeping them in one place, then we'll have to reach into another package at some point.

I'm for going the gradle way and splitting tests into the packages that define them. As long as those tests in the spark module run somehow, I don't think it matters that they aren't bundled with the rest of the validates-runner tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with a workaround to get this started, but would keep the goal that we could e.g. delete a runner directory entirely without affecting the core SDK build. So textual references are also a form of dependency I'd prefer to avoid. When the dependency graph of tasks doesn't match the dependency graph of modules, you have some real spaghetti potential.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that there exists a file under sdks/java/core that references a particular runner. That is undesirable. But we should go ahead with this now and figure out how to get gradle to do it right later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll leave it as is for now. As a longer-term fix, how do you feel about separating out ValidatesRunner into its own gradle module (and directory) and having that depend on each runner? I wouldn't want to make that change before removing the maven build entirely, but I'm trying to figure out if there's a proper way forward from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some additional massaging, I was able to get the tests running from within the Flink directory without manually unzipping jars. It does impose a strict evaluation dependency between modules, but at least it goes in the direction we would prefer. I'll add that as a separate commit on top of this. Let me know if this is satisfactory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a place to look at the change, either a commit on github or another PR that is not for review to show what you tried?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, found it here: #4436

@kennknowles
Copy link
Member

Side note: I know you are focused on particular runners, but for the gradle migration we should make it very easy to add another runner's ValidatesRunner config, so we can get all the ones one master, and preferably the ones that are active on branches (some on branches have stagnated a bit).

@bsidhom
Copy link
Contributor Author

bsidhom commented Jan 17, 2018

@kennknowles The ease of migration is exactly why I decided to structure it this way. Feel free to experiment or suggest concrete ways we could do this better. As far as @lukecwik and I can tell, this path gives us the least friction and should be easiest to maintain.

Map<String, String> systemProperties
}

def createValidatesRunner(Map m) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why the implicit it is generally better is that you can reflectively either take a map of named parameters or a ValidatesRunnerConfig directly and handle each case internally to the method. The third reason why its beneficial is because you can make a method that takes zero args and then uses the defaults but in this case this benefit is pointless since some args are required.

Map<String, String> systemProperties
}

def createValidatesRunner(Map m) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createValidatesRunner -> createValidatesRunnerTask

Also add a comment as to what the method does, what is the expected input parameter, ...

@@ -117,6 +132,110 @@ task packageTests(type: Jar) {
classifier = "tests"
}

class ValidatesRunnerConfig {
// Runner name prefix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really is the taskNamePrefix

class ValidatesRunnerConfig {
// Runner name prefix
String runner
// List of test categories to exclude from this task. Optional.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add the word optional to the beginning of the comments.

// Pipeline options command line arguments.
Map<String, String> pipelineOptions
// Configuration to use for test runtime classpath.
FileCollection configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a Configuration type instead of FileCollection.

@@ -38,6 +38,14 @@ processResources {
]
}

configurations {
flinkValidatesRunner
sparkValidatesRunner {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a place to look at the change, either a commit on github or another PR that is not for review to show what you tried?

@lukecwik
Copy link
Member

The other PR seems much nicer, feel free to close this.

@bsidhom bsidhom closed this Jan 24, 2018
@bsidhom bsidhom deleted the gradle-validates-runner branch February 7, 2018 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants