METRON-1004: Travis CI - Job Exceeded Maximum Time Limit #624

justinleet · 2017-06-23T13:54:35Z

Contributor Comments

This PR cleans up and speeds up various issues we've seen on Travis.

This should avoid the skadoo, by just outright stopping problematic threads.
Allows for the reuse of some of the InMemoryComponents, to avoid a lot of the spinup time in tests
Improvements to a variety of tests to make them faster
Refactoring of some parser integration tests to be less integrationy and more unit test-ish. Specifically because not every parser needs a full integration test and they take awhile.
Moving to VM in Travis
More details as provided in comments

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root incubating-metron folder via:
```
mvn -q clean integration-test install && build_utils/verify_licenses.sh 
```
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:
```
cd site-book
mvn site
```

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

cestella · 2017-06-23T22:43:59Z

If you succeed here, you'll be my best friend. We're drowning in integration test drama over here. :)

…f it'll maven

justinleet · 2017-06-26T14:12:28Z

I moved work from a mostly unrelated branch over here. Specifically, it attempts to get the InMemoryComponents to be reused across a given test class. This is particularly helpful for ProfilerIntegrationTest and MaasIntegrationTest. It's not extended to everything (in particular single test only classes). It does make the build run faster, but not necessarily fast enough to solve this problem. There's also some cleanup that needs to be taken care of.

Skadoo appears to be mostly resolved on this branch for some reason. It also performs an actual thread.stop(), because it needs to actually kill.

ottobackwards · 2017-06-26T14:48:32Z

This looks really good. After trying to find a better 'use' or 'sequencing' to resolve some problems - and failing - I am really glad to see this.

I do have a couple of things I have tried that I might think are worth considering with this.

cestella · 2017-06-27T10:35:45Z

This does look good. A couple of observations in no particular order of importance; just wanted to get this out there for discussion.

Considering the overhead

I want to consider the overhead not in our tests for a moment. In the last run, I count the following timings:

build - 5:41
unit tests - 2:59
integration tests - 14:44
metron-config - 2:17
verify licenses - 0:16
That's 25:57 out of a total run from Travis of 31:53, which is 5:56 overhead.

We should factor that in.

Where to Focus

Build Time

The natural conclusion is to focus on the long pole, those integration tests, but we may be served to also consider the build time. Our build takes a long time and we depend upon parallelization to make the build return in a sensible time (the user time for the build is 26 minutes!). Furthermore, our build is extremely IO heavy due to the shading that we (necessarily) do. While we are on a shared system with the rest of the apache projects, I think reducing the IO burden of our build.

While I think that shading is important, we have a very ham-fisted way of doing it. We shade for two reasons:

Relocation of dependencies that conflict
Creating uber jars for Storm

One issue is that if we consider the tree of projects induced by their dependent nature, is that we shade non-leaf projects for purpose of relocation. I propose we stop doing that. Let's take, for instance, metron-common. We shade that project to relocate guava and beanutils. The consequences of relocating 2 packages is 47M of dependencies. Those 47M of dependencies also gets bundled again into all of the leaf projects (e.g. metron-parsers, etc.), thus shading twice.

I propose fixing this one of two ways:

aggressively exclude ALL dependencies other than org.apache.metron and the relocated dependencies in any project that needs shading purely for relocation
Extract the shaded/relocated dependencies across the project into a separate project and make all of our non-leaf dependencies non-shaded

I think the first may be the easiest to achieve and most surgical.

Ultimately, it may even be advantageous to have a single jar created as the deployable output of our process (or maybe a small handful representing the independent subcomponents: metron, MaaS and stellar).

Integration Tests

Obviously the integration tests are the long pole in the tent. A couple of thoughts on these:

`TaxiiIntegrationTest`

My impression was that it was slow because parsing taxii via the mitre library was downright arduous. It costs us ~2:30 as of the working build above.

We are passing a relatively large blob of taxii in and should consider trimming the taxii example data down to something more manageable and see if that will drop the timing down.

`PcapIntegrationTest`

We currently test two modes for the PcapIntegrationTest, pulling the timestamp from the key and pulling the timestamp from the message itself. We know that in production, we only want to support pulling the timestamp from the key. We might cut this test time in half by only testing the supported approach (it's 81 seconds as of last count).

`Parser Integration Tests`

We might want to reconsider what we integration test here. We currently have an integration test for every parser and we may get the same coverage by mocking out the ParserWriterBolt and constructing a shim to pass data in, execute against the real parser bolt, capture data written and evaluate the output. This would drop the overhead for each parser test dramatically (no storm or kafka) and would keep the semantics of the tests. Admittedly this may not be a focus in terms of bang-for-buck because total parser cost is around 86 seconds.

Reuse Integration Test Infrastructure

This seems to be the persistent conversation whenever our tests start to push us over the edge. We incur quite a bit of overhead because we spin up and down integration test infrastructure in our InMemoryComponents. We could consider correcting this in a couple of ways:

Reusing the infrastructure
- Either use the in memory components or spin up light weight versions of the infrastructure and then run the tests against that (i.e. docker or separate-process versions of the in-memory components).
- We'd need to refactor each integration test to clean up after itself so other tests are not splashed
Parallelizing the Integration Tests
- Have the InMemoryComponents be able to run in parallel
- This would require refactoring the components to seek for open ports and use them.

These are just my thoughts that I wanted to get out there and capture for review and comment.

cestella · 2017-06-27T13:06:00Z

I submitted PRs against this branch to incorporate the suggested changes above for:

Selective shading for non-leaf projects to cut the build times dramatically.
TaxiiIntegrationTest
PcapIntegrationTest

I submit them without credit.

justinleet · 2017-06-27T13:21:59Z

As a note, what I have is currently the first steps towards reusing infra. It's not perfect, and it's not reused across classes.

There was an attempt to use the build matrix to split fast and slow tests, but it resulted in inconsistent failures. Seems like Maven gets tangled up between the builds. Could merit further investigation. it'll increase processing time (because both unit and integration tests have to actually build), but should avoid having either portion of the build timeout.

ottobackwards · 2017-06-27T15:44:17Z

...ap-backend/src/test/java/org/apache/metron/pcap/integration/PcapTopologyIntegrationTest.java

@@ -90,23 +90,6 @@ public boolean accept(File dir, String name) {
  }



I think until this is completely deprecated, we should keep the test, but disable it.

Test is added back in, but with an @Ignore and a comment. Let me know if you want anything else there.

ottobackwards · 2017-06-27T15:46:41Z

.travis.yml

@@ -17,7 +17,7 @@ before_install:
  - export PATH=$M2_HOME/bin:$PATH
 script:
  - |


because it makes it slow right?
Can we document with the commits, as you go, the rationale behind the changes, so we can look back and understand a little bit?

"why did we get rid of FOO?"
Let me check the commit log

" Remove foo. It is seen to cause an increase of X in Y and do z. it is also pretty snarky and fresh"

"Oh, that makes sense"

Sorry, my bad. Usually I tend to consider the commits less important because it's usually a full feature, and it's just minor changes / fixes afterwards.

I'll try to make sure the messages are easier to follow, since this is pretty ongoing until it's consistent.

justinleet · 2017-06-27T15:59:23Z

As @ottobackwards points out, there should be some more explanation for a recent commit. 6562246

Specifically it turns off the jacoco:prepare-agent from the Travis build and allows us to just run the tests directly without it. We don't actually produce the reports here, so it's pretty extraneous.

Locally this resulted in:

mvn -q -T 2C surefire:test@unit-tests  309.46s user 20.20s system 169% cpu 3:14.92 total
mvn -q -T 2C jacoco:prepare-agent surefire:test@unit-tests  555.20s user 25.93s system 254% cpu 3:47.90 total

…cache them

justinleet · 2017-06-28T03:41:00Z

I added a fix to actually clear out the correct directory of Maven artifacts before caching. In a separate, experimental branch, there's an attempt to cache the artifacts resulting from npm. See: https://github.com/justinleet/metron/tree/caching and https://travis-ci.org/justinleet/metron. This required a run without the integration tests on in order to make it to the populating the cache successfully, then reenabling them next commit.

At this point we do have intermittent successful builds on my Travis, although I'm doubt it's consistent.

justinleet · 2017-06-29T21:30:10Z

Most recent commits attempt to fix what appears to be a preexisting intermittent test issue regarding Kafka.

cestella · 2017-06-29T21:49:10Z

...ron-rest/src/test/java/org/apache/metron/rest/controller/KafkaControllerIntegrationTest.java

@@ -61,6 +62,7 @@
  private static final int KAFKA_RETRY = 10;
  @Autowired
  private KafkaComponent kafkaWithZKComponent;
+  private ComponentRunner runner;


Can we make this and setup() static and make the setup/teardown be @BeforeClass/@afterclass?

I don't think so. setup() uses kafkaWithZKComponent, which is @Autowired. I don't think that can be static, so there's probably more refactoring involved. I can look at it, if it you want.

I don't think this is that critical to getting us back to a working travis build. Just my opinion.

I looked at it a bit this morning and @justinleet is right, it's not easily done. I'm not super happy about spinning up and down components per test-case considering the bulk of this PR has been around removing exactly that. That being said, the test is ~20s, so I'm not going to lose any sleep over it on that regard.

I think the part that I don't like most is the mixture of semantics between Spring's wiring of components and JUnit's setting up of components that we can't mix and match. It leaves a muddle, frankly:

If someone needs more heavy-weight components in other tests in this project, they will be forced into spinning up and down infrastructure per test-case, which has proven costly and was a prominent thing that we targeted in this PR for removal.

Our approach leaves us doing component reuse in a way that is neither standard based on the other components nor standard based on spring, which is confusing and will inevitably lead to inefficient and confusing tests.

There is evidently a fix from spring here and I'm creating METRON-1009 to capture the reversion of this unit test and uptaking the fix provided by spring or a port of the fix by spring to our components.

@merrimanr is right, it's not a requirement to get this perfect as long as we've removed the intermittent nature of the test failure (and I feel we have). It's not so bad that I am -1, but it's ugly, confusing and it will lead to errors in the future IMO.

I'm going to add a comment specifically pointing to that ticket, and noting that this test is an outlier in terms of how we actually test.

@cestella Is there anything else in particular that you'd like to see in order to at least mitigate risk of propagating test issues like this until we upgrade spring?

no, I think we've done enough. We, honestly, need a test guidance document similar to the parser testing document that I created as part of this ticket. It's become evident from this exercise that we can provide some over-arching guidance here to limit the probability of us arriving in this situation again. But that's a follow-on, IMO.

I agree with everything that's been said

cestella · 2017-06-29T21:51:46Z

...ron-rest/src/test/java/org/apache/metron/rest/controller/KafkaControllerIntegrationTest.java

+    try {
+      runner.start();
+    } catch (UnableToStartException e) {
+      e.printStackTrace();


Also, we should throw a runtime exception here, probably.

Why not just kill the try/catch entirely? If it fails, it might as well blow up the tests anyways

I'd be ok with that. This will just mask an error until it fails later less explicably.

I forgot this was in a run(). I changed it to a runtime exception, like you originally suggested.

justinleet · 2017-06-30T07:33:29Z

NC

justinleet · 2017-06-30T07:35:13Z

Sigh, accidentally submitted that while kicking travis

cestella · 2017-06-30T10:14:54Z

+1 by inspection.

ottobackwards · 2017-06-30T12:24:53Z

I am +1 on these changes.

justinleet · 2017-06-30T12:48:54Z

I'm gonna do one more to hit a round ten successes, then merge assuming no issues. Shouldn't be too long.

merrimanr · 2017-06-30T13:24:23Z

+1. Really nice job guys. Looking forward to working builds again!

justinleet added 3 commits June 23, 2017 09:15

first pass. Still need to fix storm kafka interaction + possibly cleanup

bd484b5

fixes plus logging

a470063

more stuffs

05e0a64

justinleet added 8 commits June 24, 2017 13:49

more fixes and update

05b9c1f

removing error code

912575e

maybe making it work for superclasses like the parser tests. Unsure i…

11473f6

…f it'll maven

Removing approximately 1M log errors by actually cleaning up ZK

fa5c052

empty

a5b152e

Removing our artifacts before caching

c9f072b

Trying to flush cache

d4ab212

Undo flush

f2979c2

justinleet force-pushed the METRON-1004 branch from e4c17a1 to f3200a5 Compare June 26, 2017 14:02

Adding logging to try to figure out what's going on

97a679e

justinleet force-pushed the METRON-1004 branch from f3200a5 to 97a679e Compare June 26, 2017 14:06

Adding time to integration test command

aa837e0

justinleet closed this Jun 26, 2017

justinleet reopened this Jun 26, 2017

cestella and others added 2 commits June 27, 2017 08:09

trimming tests down.

b1ede14

Removing jacoco from Travis build

6562246

ottobackwards reviewed Jun 27, 2017

View reviewed changes

Properly handle clearing out Metron artifacts from Maven so we don't …

ad27176

…cache them

cestella reviewed Jun 29, 2017

View reviewed changes

justinleet closed this Jun 29, 2017

justinleet reopened this Jun 29, 2017

justinleet closed this Jun 29, 2017

justinleet reopened this Jun 29, 2017

justinleet closed this Jun 30, 2017