STORM-2525: Fix flaky integration tests #2127

srdo · 2017-05-20T19:42:36Z

https://issues.apache.org/jira/browse/STORM-2525

EvictionPolicies based on watermarks could throw NPEs if they were asked to decide eviction before first watermark had arrived. This is possible if the WindowingManager decides to compact the window. Made the policies refuse to evict events until the first watermark has arrived.
TopoWrap readLogs was hiding IOExceptions when reading from the logviewer, replacing the real log with a dummy. If this happened, the tests were likely to fail with a misleading error. Made the IOException propagate out to the test instead.
The tests are now checking all the windows produced by the windowing bolt. I couldn't tell what the previous calculation of the expected number of windows was going for. The original was final int numberOfWindows = allBoltLog.size() - windowSec / slideSec;, but it seems pretty reasonable to me to just check all the windows instead.
Made the tests fail if the emit count reported by the built in metrics does not manage to reach the expected minEmits within the time limit.
The windowing integration tests are using the built in metrics to wait for the bolt or spout to emit enough tuples, then using the logviewer to read the worker log. The log entries are then used to determine if the windowing bolt produced the right windows. If the log rolls after the tests have decided enough tuples have been emitted, but before the test manages to read the log, the test is likely to fail, because the log won't account for the emits listed in the archived log file. Increased the rate of tuple emission from the test spouts, but reduced the number of tuples required before the test will start evaluating the log. This is not a complete fix, but producing 100MB of log before we emit 500 tuples (current largest test) seems unlikely.
The metrics sample rate has been set to 1 when using the TopoWrap class. One of the TopoWrap methods offers to get the number of tuples emitted from a component, but the accuracy of the returned number depends on the metrics sample rate. This was causing test flakiness when using TopoWrap.waitForProgress.
Replaced the vagrant image with a newer Ubuntu image. The older image seemed to perform poorly, maybe due to older Virtualbox guest additions? The running time for me went from 40+ minutes to ~15.
Remove the cluster.xml in integration-test/config. It was unused.

I've run the integration tests 5 times with no errors, so it seems stable now.

There is another issue semi-related to this that I'd like some input on. I'll raise another issue if this behavior seems broken to anyone else:

The windowing docs claim that windowing offers at least once processing. This is true for configurations using tuple timestamps (i.e. those ensuring that

storm/storm-client/src/jvm/org/apache/storm/topology/IWindowedBolt.java

Line 50 in 64e29f3

TimestampExtractor getTimestampExtractor();

does not return null), but the count and time based configuration without tuple timestamps may expire tuples before they've been presented to the user's bolt in a window. This is because the WindowManager will attempt to expire tuples if there are more than 100 tuples in the current window (see

storm/storm-client/src/jvm/org/apache/storm/windowing/WindowManager.java

Line 50 in a4afacd

public static final int EXPIRE_EVENTS_THRESHOLD = 100;

). The tuples are acked as soon as they're expired

storm/storm-client/src/jvm/org/apache/storm/windowing/WindowManager.java

Line 215 in a4afacd

windowLifecycleListener.onExpiry(eventsToExpire);

When tuple timestamps are used, tuples will only expire when the watermark moves, and when the watermark moves, the trigger policy ensures all stored tuples get passed to the user's bolt in an appropriate window. When tuple timestamps, and thus watermarks, are disabled, tuples may get expired during compaction regardless of whether the user's bolt has seen those tuples yet.

I'm not sure if we should just put a note in the windowing docs about this, or if it is better to remove the compaction mechanism? I'm leaning towards removing the compaction mechanism. I don't think it's actually saving much memory since evicted tuples remain stored in a list until the user's bolt is passed the next window (at which point the expiration code would be run again, and the tuples would be expired anyway), and the behavior is not that easy to understand without reading the source. For example you can only get a count window larger than 100 if you use tuple timestamps, since the manager will expire tuples once the window reaches that size if watermarking is disabled.

srdo · 2017-05-20T19:44:14Z

integration-test/run-it.sh

@@ -32,18 +32,19 @@ function list_storm_processes() {

 list_storm_processes || true
 # increasing swap space so we can run lots of workers
-sudo dd if=/dev/zero of=/swapfile.img bs=8192 count=1M
+sudo dd if=/dev/zero of=/swapfile.img bs=4096 count=1M


The ubuntu image has a 10GB disk, so 8GB swap doesn't fit. I didn't see the swap exceed a few hundred megs anyway, so I lowered this a bit.

srdo · 2017-05-20T19:47:31Z

pom.xml

+                <scope>test</scope>
+            </dependency>
+            <dependency>
+                <groupId>org.mockito</groupId>
                <artifactId>mockito-all</artifactId>


I'd like us to transition to using mockito-core instead of this. The all artifact is no longer published, and it can cause issues when hamcrest libraries are also on the classpath.

revans2

+1 I am not an expert on the windowing functions so I would prefer is someone who is also took a look. All the changes made since to me.

revans2 · 2017-05-23T17:56:20Z

Well no one else has looked at it, and it does cut 30 mins off of a build, so I am going to check it in, and if we run into issues in the future we can fix them then.

srdo · 2017-05-23T18:01:01Z

Sounds good, thanks for the review.

srdo commented May 20, 2017

View reviewed changes

srdo force-pushed the STORM-2525 branch 2 times, most recently from bf6fd17 to ebd3288 Compare May 20, 2017 21:46

STORM-2525: Fix flaky integration tests

1db8674

srdo force-pushed the STORM-2525 branch from ebd3288 to 1db8674 Compare May 20, 2017 22:20

revans2 approved these changes May 22, 2017

View reviewed changes

srdo mentioned this pull request May 22, 2017

STORM-2489: Overlap and data loss on WindowedBolt based on Duration #2090

Merged

asfgit merged commit 1db8674 into apache:master May 23, 2017

srdo mentioned this pull request Nov 11, 2017

STORM-2525: Fix flaky integration tests (1.x) #2411

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STORM-2525: Fix flaky integration tests #2127

STORM-2525: Fix flaky integration tests #2127

srdo commented May 20, 2017

srdo May 20, 2017 •

edited

srdo May 20, 2017

revans2 left a comment

revans2 commented May 23, 2017

srdo commented May 23, 2017

STORM-2525: Fix flaky integration tests #2127

STORM-2525: Fix flaky integration tests #2127

Conversation

srdo commented May 20, 2017

srdo May 20, 2017 • edited

Choose a reason for hiding this comment

srdo May 20, 2017

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented May 23, 2017

srdo commented May 23, 2017

srdo May 20, 2017 •

edited