Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-1004: Travis CI - Job Exceeded Maximum Time Limit #624

Closed
wants to merge 43 commits into from

Conversation

justinleet
Copy link
Contributor

@justinleet justinleet commented Jun 23, 2017

Contributor Comments

This PR cleans up and speeds up various issues we've seen on Travis.

  • This should avoid the skadoo, by just outright stopping problematic threads.
  • Allows for the reuse of some of the InMemoryComponents, to avoid a lot of the spinup time in tests
  • Improvements to a variety of tests to make them faster
  • Refactoring of some parser integration tests to be less integrationy and more unit test-ish. Specifically because not every parser needs a full integration test and they take awhile.
  • Moving to VM in Travis
  • More details as provided in comments

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?

  • Have you included steps or a guide to how the change may be verified and tested manually?

  • Have you ensured that the full suite of tests and checks have been executed in the root incubating-metron folder via:

    mvn -q clean integration-test install && build_utils/verify_licenses.sh 
    
  • Have you written or updated unit tests and or integration tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:

    cd site-book
    mvn site
    

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

@cestella
Copy link
Member

If you succeed here, you'll be my best friend. We're drowning in integration test drama over here. :)

@justinleet
Copy link
Contributor Author

I moved work from a mostly unrelated branch over here. Specifically, it attempts to get the InMemoryComponents to be reused across a given test class. This is particularly helpful for ProfilerIntegrationTest and MaasIntegrationTest. It's not extended to everything (in particular single test only classes). It does make the build run faster, but not necessarily fast enough to solve this problem. There's also some cleanup that needs to be taken care of.

Skadoo appears to be mostly resolved on this branch for some reason. It also performs an actual thread.stop(), because it needs to actually kill.

@ottobackwards
Copy link
Contributor

This looks really good. After trying to find a better 'use' or 'sequencing' to resolve some problems - and failing - I am really glad to see this.

I do have a couple of things I have tried that I might think are worth considering with this.

@justinleet justinleet closed this Jun 26, 2017
@justinleet justinleet reopened this Jun 26, 2017
@cestella
Copy link
Member

This does look good. A couple of observations in no particular order of importance; just wanted to get this out there for discussion.

Considering the overhead

I want to consider the overhead not in our tests for a moment. In the last run, I count the following timings:

  • build - 5:41
  • unit tests - 2:59
  • integration tests - 14:44
  • metron-config - 2:17
  • verify licenses - 0:16
    That's 25:57 out of a total run from Travis of 31:53, which is 5:56 overhead.

We should factor that in.

Where to Focus

Build Time

The natural conclusion is to focus on the long pole, those integration tests, but we may be served to also consider the build time. Our build takes a long time and we depend upon parallelization to make the build return in a sensible time (the user time for the build is 26 minutes!). Furthermore, our build is extremely IO heavy due to the shading that we (necessarily) do. While we are on a shared system with the rest of the apache projects, I think reducing the IO burden of our build.

While I think that shading is important, we have a very ham-fisted way of doing it. We shade for two reasons:

  • Relocation of dependencies that conflict
  • Creating uber jars for Storm

One issue is that if we consider the tree of projects induced by their dependent nature, is that we shade non-leaf projects for purpose of relocation. I propose we stop doing that. Let's take, for instance, metron-common. We shade that project to relocate guava and beanutils. The consequences of relocating 2 packages is 47M of dependencies. Those 47M of dependencies also gets bundled again into all of the leaf projects (e.g. metron-parsers, etc.), thus shading twice.

I propose fixing this one of two ways:

  • aggressively exclude ALL dependencies other than org.apache.metron and the relocated dependencies in any project that needs shading purely for relocation
  • Extract the shaded/relocated dependencies across the project into a separate project and make all of our non-leaf dependencies non-shaded

I think the first may be the easiest to achieve and most surgical.

Ultimately, it may even be advantageous to have a single jar created as the deployable output of our process (or maybe a small handful representing the independent subcomponents: metron, MaaS and stellar).

Integration Tests

Obviously the integration tests are the long pole in the tent. A couple of thoughts on these:

TaxiiIntegrationTest

My impression was that it was slow because parsing taxii via the mitre library was downright arduous. It costs us ~2:30 as of the working build above.

We are passing a relatively large blob of taxii in and should consider trimming the taxii example data down to something more manageable and see if that will drop the timing down.

PcapIntegrationTest

We currently test two modes for the PcapIntegrationTest, pulling the timestamp from the key and pulling the timestamp from the message itself. We know that in production, we only want to support pulling the timestamp from the key. We might cut this test time in half by only testing the supported approach (it's 81 seconds as of last count).

Parser Integration Tests

We might want to reconsider what we integration test here. We currently have an integration test for every parser and we may get the same coverage by mocking out the ParserWriterBolt and constructing a shim to pass data in, execute against the real parser bolt, capture data written and evaluate the output. This would drop the overhead for each parser test dramatically (no storm or kafka) and would keep the semantics of the tests. Admittedly this may not be a focus in terms of bang-for-buck because total parser cost is around 86 seconds.

Reuse Integration Test Infrastructure

This seems to be the persistent conversation whenever our tests start to push us over the edge. We incur quite a bit of overhead because we spin up and down integration test infrastructure in our InMemoryComponents. We could consider correcting this in a couple of ways:

  • Reusing the infrastructure
    • Either use the in memory components or spin up light weight versions of the infrastructure and then run the tests against that (i.e. docker or separate-process versions of the in-memory components).
    • We'd need to refactor each integration test to clean up after itself so other tests are not splashed
  • Parallelizing the Integration Tests
    • Have the InMemoryComponents be able to run in parallel
    • This would require refactoring the components to seek for open ports and use them.

These are just my thoughts that I wanted to get out there and capture for review and comment.

@cestella
Copy link
Member

I submitted PRs against this branch to incorporate the suggested changes above for:

  • Selective shading for non-leaf projects to cut the build times dramatically.
  • TaxiiIntegrationTest
  • PcapIntegrationTest

I submit them without credit.

@justinleet
Copy link
Contributor Author

As a note, what I have is currently the first steps towards reusing infra. It's not perfect, and it's not reused across classes.

There was an attempt to use the build matrix to split fast and slow tests, but it resulted in inconsistent failures. Seems like Maven gets tangled up between the builds. Could merit further investigation. it'll increase processing time (because both unit and integration tests have to actually build), but should avoid having either portion of the build timeout.

@@ -90,23 +90,6 @@ public boolean accept(File dir, String name) {
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think until this is completely deprecated, we should keep the test, but disable it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is added back in, but with an @Ignore and a comment. Let me know if you want anything else there.

.travis.yml Outdated
@@ -17,7 +17,7 @@ before_install:
- export PATH=$M2_HOME/bin:$PATH
script:
- |
Copy link
Contributor

@ottobackwards ottobackwards Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it makes it slow right?
Can we document with the commits, as you go, the rationale behind the changes, so we can look back and understand a little bit?

"why did we get rid of FOO?"
Let me check the commit log

" Remove foo. It is seen to cause an increase of X in Y and do z. it is also pretty snarky and fresh"

"Oh, that makes sense"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad. Usually I tend to consider the commits less important because it's usually a full feature, and it's just minor changes / fixes afterwards.

I'll try to make sure the messages are easier to follow, since this is pretty ongoing until it's consistent.

@justinleet
Copy link
Contributor Author

justinleet commented Jun 27, 2017

As @ottobackwards points out, there should be some more explanation for a recent commit. 6562246

Specifically it turns off the jacoco:prepare-agent from the Travis build and allows us to just run the tests directly without it. We don't actually produce the reports here, so it's pretty extraneous.

Locally this resulted in:

mvn -q -T 2C surefire:test@unit-tests  309.46s user 20.20s system 169% cpu 3:14.92 total
mvn -q -T 2C jacoco:prepare-agent surefire:test@unit-tests  555.20s user 25.93s system 254% cpu 3:47.90 total

@justinleet
Copy link
Contributor Author

I added a fix to actually clear out the correct directory of Maven artifacts before caching. In a separate, experimental branch, there's an attempt to cache the artifacts resulting from npm. See: https://github.com/justinleet/metron/tree/caching and https://travis-ci.org/justinleet/metron. This required a run without the integration tests on in order to make it to the populating the cache successfully, then reenabling them next commit.

At this point we do have intermittent successful builds on my Travis, although I'm doubt it's consistent.

@justinleet
Copy link
Contributor Author

Most recent commits attempt to fix what appears to be a preexisting intermittent test issue regarding Kafka.

@@ -61,6 +62,7 @@
private static final int KAFKA_RETRY = 10;
@Autowired
private KafkaComponent kafkaWithZKComponent;
private ComponentRunner runner;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this and setup() static and make the setup/teardown be @BeforeClass/@afterclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. setup() uses kafkaWithZKComponent, which is @Autowired. I don't think that can be static, so there's probably more refactoring involved. I can look at it, if it you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is that critical to getting us back to a working travis build. Just my opinion.

Copy link
Member

@cestella cestella Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at it a bit this morning and @justinleet is right, it's not easily done. I'm not super happy about spinning up and down components per test-case considering the bulk of this PR has been around removing exactly that. That being said, the test is ~20s, so I'm not going to lose any sleep over it on that regard.

I think the part that I don't like most is the mixture of semantics between Spring's wiring of components and JUnit's setting up of components that we can't mix and match. It leaves a muddle, frankly:

  • If someone needs more heavy-weight components in other tests in this project, they will be forced into spinning up and down infrastructure per test-case, which has proven costly and was a prominent thing that we targeted in this PR for removal.
  • Our approach leaves us doing component reuse in a way that is neither standard based on the other components nor standard based on spring, which is confusing and will inevitably lead to inefficient and confusing tests.

There is evidently a fix from spring here and I'm creating METRON-1009 to capture the reversion of this unit test and uptaking the fix provided by spring or a port of the fix by spring to our components.

@merrimanr is right, it's not a requirement to get this perfect as long as we've removed the intermittent nature of the test failure (and I feel we have). It's not so bad that I am -1, but it's ugly, confusing and it will lead to errors in the future IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add a comment specifically pointing to that ticket, and noting that this test is an outlier in terms of how we actually test.

@cestella Is there anything else in particular that you'd like to see in order to at least mitigate risk of propagating test issues like this until we upgrade spring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I think we've done enough. We, honestly, need a test guidance document similar to the parser testing document that I created as part of this ticket. It's become evident from this exercise that we can provide some over-arching guidance here to limit the probability of us arriving in this situation again. But that's a follow-on, IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with everything that's been said

try {
runner.start();
} catch (UnableToStartException e) {
e.printStackTrace();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should throw a runtime exception here, probably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just kill the try/catch entirely? If it fails, it might as well blow up the tests anyways

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with that. This will just mask an error until it fails later less explicably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot this was in a run(). I changed it to a runtime exception, like you originally suggested.

@justinleet justinleet closed this Jun 29, 2017
@justinleet justinleet reopened this Jun 29, 2017
@justinleet justinleet closed this Jun 29, 2017
@justinleet justinleet reopened this Jun 29, 2017
@justinleet justinleet closed this Jun 30, 2017
@justinleet justinleet reopened this Jun 30, 2017
@justinleet justinleet closed this Jun 30, 2017
@justinleet
Copy link
Contributor Author

NC

@justinleet justinleet reopened this Jun 30, 2017
@justinleet
Copy link
Contributor Author

Sigh, accidentally submitted that while kicking travis

@justinleet justinleet closed this Jun 30, 2017
@justinleet justinleet reopened this Jun 30, 2017
@cestella
Copy link
Member

+1 by inspection.

@justinleet justinleet changed the title [DO NOT MERGE] METRON-1004: Travis CI - Job Exceeded Maximum Time Limit METRON-1004: Travis CI - Job Exceeded Maximum Time Limit Jun 30, 2017
@justinleet justinleet closed this Jun 30, 2017
@justinleet justinleet reopened this Jun 30, 2017
@justinleet justinleet closed this Jun 30, 2017
@justinleet justinleet reopened this Jun 30, 2017
@ottobackwards
Copy link
Contributor

I am +1 on these changes.

@justinleet
Copy link
Contributor Author

I'm gonna do one more to hit a round ten successes, then merge assuming no issues. Shouldn't be too long.

@justinleet justinleet closed this Jun 30, 2017
@justinleet justinleet reopened this Jun 30, 2017
@merrimanr
Copy link
Contributor

+1. Really nice job guys. Looking forward to working builds again!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants