Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2525: Fix flaky integration tests #2127

Merged
merged 1 commit into from May 23, 2017
Merged

Conversation

srdo
Copy link
Contributor

@srdo srdo commented May 20, 2017

https://issues.apache.org/jira/browse/STORM-2525

  • EvictionPolicies based on watermarks could throw NPEs if they were asked to decide eviction before first watermark had arrived. This is possible if the WindowingManager decides to compact the window. Made the policies refuse to evict events until the first watermark has arrived.
  • TopoWrap readLogs was hiding IOExceptions when reading from the logviewer, replacing the real log with a dummy. If this happened, the tests were likely to fail with a misleading error. Made the IOException propagate out to the test instead.
  • The tests are now checking all the windows produced by the windowing bolt. I couldn't tell what the previous calculation of the expected number of windows was going for. The original was final int numberOfWindows = allBoltLog.size() - windowSec / slideSec;, but it seems pretty reasonable to me to just check all the windows instead.
  • Made the tests fail if the emit count reported by the built in metrics does not manage to reach the expected minEmits within the time limit.
  • The windowing integration tests are using the built in metrics to wait for the bolt or spout to emit enough tuples, then using the logviewer to read the worker log. The log entries are then used to determine if the windowing bolt produced the right windows. If the log rolls after the tests have decided enough tuples have been emitted, but before the test manages to read the log, the test is likely to fail, because the log won't account for the emits listed in the archived log file. Increased the rate of tuple emission from the test spouts, but reduced the number of tuples required before the test will start evaluating the log. This is not a complete fix, but producing 100MB of log before we emit 500 tuples (current largest test) seems unlikely.
  • The metrics sample rate has been set to 1 when using the TopoWrap class. One of the TopoWrap methods offers to get the number of tuples emitted from a component, but the accuracy of the returned number depends on the metrics sample rate. This was causing test flakiness when using TopoWrap.waitForProgress.
  • Replaced the vagrant image with a newer Ubuntu image. The older image seemed to perform poorly, maybe due to older Virtualbox guest additions? The running time for me went from 40+ minutes to ~15.
  • Remove the cluster.xml in integration-test/config. It was unused.

I've run the integration tests 5 times with no errors, so it seems stable now.

There is another issue semi-related to this that I'd like some input on. I'll raise another issue if this behavior seems broken to anyone else:

The windowing docs claim that windowing offers at least once processing. This is true for configurations using tuple timestamps (i.e. those ensuring that

TimestampExtractor getTimestampExtractor();
does not return null), but the count and time based configuration without tuple timestamps may expire tuples before they've been presented to the user's bolt in a window. This is because the WindowManager will attempt to expire tuples if there are more than 100 tuples in the current window (see
public static final int EXPIRE_EVENTS_THRESHOLD = 100;
). The tuples are acked as soon as they're expired
windowLifecycleListener.onExpiry(eventsToExpire);

When tuple timestamps are used, tuples will only expire when the watermark moves, and when the watermark moves, the trigger policy ensures all stored tuples get passed to the user's bolt in an appropriate window. When tuple timestamps, and thus watermarks, are disabled, tuples may get expired during compaction regardless of whether the user's bolt has seen those tuples yet.

I'm not sure if we should just put a note in the windowing docs about this, or if it is better to remove the compaction mechanism? I'm leaning towards removing the compaction mechanism. I don't think it's actually saving much memory since evicted tuples remain stored in a list until the user's bolt is passed the next window (at which point the expiration code would be run again, and the tuples would be expired anyway), and the behavior is not that easy to understand without reading the source. For example you can only get a count window larger than 100 if you use tuple timestamps, since the manager will expire tuples once the window reaches that size if watermarking is disabled.

@@ -32,18 +32,19 @@ function list_storm_processes() {

list_storm_processes || true
# increasing swap space so we can run lots of workers
sudo dd if=/dev/zero of=/swapfile.img bs=8192 count=1M
sudo dd if=/dev/zero of=/swapfile.img bs=4096 count=1M
Copy link
Contributor Author

@srdo srdo May 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ubuntu image has a 10GB disk, so 8GB swap doesn't fit. I didn't see the swap exceed a few hundred megs anyway, so I lowered this a bit.

<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like us to transition to using mockito-core instead of this. The all artifact is no longer published, and it can cause issues when hamcrest libraries are also on the classpath.

@srdo srdo force-pushed the STORM-2525 branch 2 times, most recently from bf6fd17 to ebd3288 Compare May 20, 2017 21:46
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I am not an expert on the windowing functions so I would prefer is someone who is also took a look. All the changes made since to me.

@revans2
Copy link
Contributor

revans2 commented May 23, 2017

Well no one else has looked at it, and it does cut 30 mins off of a build, so I am going to check it in, and if we run into issues in the future we can fix them then.

@srdo
Copy link
Contributor Author

srdo commented May 23, 2017

Sounds good, thanks for the review.

@asfgit asfgit merged commit 1db8674 into apache:master May 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants