STORM-3120: Clean up leftover null checks in Time, ensure idle thread…#2734
STORM-3120: Clean up leftover null checks in Time, ensure idle thread…#2734asfgit merged 1 commit intoapache:masterfrom
Conversation
| this.tmpDirs.add(nimbusTmp); | ||
| stormHomeBackup = System.getProperty(ConfigUtils.STORM_HOME); | ||
| TmpPath stormHome = new TmpPath(); | ||
| if (!stormHome.getFile().mkdirs()) { |
There was a problem hiding this comment.
I might be going too strict, but this looks like follow-up of STORM-3116 (not relevant to Time) which would be better to have individual PR. The change is minor so no need to file a new issue. Just would like to avoid things getting mixed up in a commit.
Will continue reviewing other change in PR.
There was a problem hiding this comment.
Yes, it is. Do you want me to put this into a STORM-3116 PR, or what do you mean?
There was a problem hiding this comment.
I hope to see this being separated PR, tagging as STORM-3116 or just tagging as MINOR. Either is fine.
…s get to run when cluster time is advanced
| public static void sleep(long ms) throws InterruptedException { | ||
| if (ms > 0) { | ||
| sleepUntil(currentTimeMillis() + ms); | ||
| if (SIMULATING.get()) { |
There was a problem hiding this comment.
Is there any reason to not use sleepUntil? If you think sleepUntil is broken, the method must be fixed since the method is exposed to public. If not, why not using sleepUntil?
There was a problem hiding this comment.
sleepUntil works fine. It's just a little silly in the case where we aren't simulating, because this method would calculate currentTimeMillis() + ms, and sleepUntil would then do the reverse to calculate how many ms to pass to Thread.sleep.
| public static void parkNanos(long nanos) throws InterruptedException { | ||
| if (nanos > 0) { | ||
| sleepUntilNanos(nanoTime() + nanos); | ||
| if (SIMULATING.get()) { |
There was a problem hiding this comment.
Same here, and we may note that we're renaming public method, though Time class is more likely internal class though.
There was a problem hiding this comment.
In this case, I'd like to use LockSupport.parkNanos if we're not simulating, rather than sleeping for the closest number of milliseconds as the method did before.
There was a problem hiding this comment.
Whoops, forgot to respond to the renaming bit: Yes, the method is renamed. If we want to port this to 1.x, we can easily deprecate sleepNanos and provide both methods.
|
Looks good to me overall except two comments above. (I forgot to comment...) |
| } | ||
| } | ||
| LOG.debug("Advanced simulated time to {}", newTime); | ||
| } |
There was a problem hiding this comment.
nit: I don't think this code change is needed, as you fixed the bug with checking for THREAD_SLEEP_TIMES_NANOS == null, so the finally should work. If it does not work, then java itself has some bugs in it.
There was a problem hiding this comment.
That's not why this change is added. I want to make sure that when a test does LocalCluster.advanceClusterTime, all sleeping threads that are supposed to wake up actually get a chance to run before waitForIdle is checked again. The issue is that if we don't remove the thread from the map here, there isn't (as far as I can tell) a guarantee that the sleeping threads actually woke up.
For example, if a test has a loop
while (some condition)
some code
LocalCluster.advanceClusterTime
the first call to advanceClusterTime will wait until all threads are sleeping. If thread 1 is told to sleep for 1ms, it will enter the THREAD_SLEEP_TIMES_NANOS map. On the next iteration, LocalCluster.advanceClusterTime will increment the time, which should cause thread 1 to wake up. If the testing thread is "too fast", it might recheck waitForIdle and catch thread 1 before it manages to wake up and remove itself from THREAD_SLEEP_TIMES_NANOS. This causes LocalCluster.advanceClusterTime to return before the cluster is actually idle, because thread 1 will wake up and start running again after advanceClusterTime returns.
The effect is you can't be sure that all threads are actually idle once LocalCluster.advanceClusterTime returns.
…-3120 STORM-3120: Clean up leftover null checks in Time, ensure idle thread This closes #2734
…s get to run when cluster time is advanced
https://issues.apache.org/jira/browse/STORM-3120
Some of the Time code didn't make sense, e.g. checking for null on the final THREAD_SLEEP_TIMES_NANOS map. It was left over from an earlier refactor of Time. Cleaned it up, and deleted the deprecated Time methods. Also made sure that when simulated time is advanced, idle threads are removed from the THREAD_SLEEP_TIMES_NANOS immediately. When the LocalCluster waits for the topology to be idle, it checks whether all timer threads are in the THREAD_SLEEP_TIMES_NANOS map. It looks to me like there was no guarantee that sleeping threads actually got a chance to run when time was advanced, because they remained in the THREAD_SLEEP_TIMES_NANOS map until they happened to be scheduled. With bad luck, a thread might end up being counted as idle when it just hadn't exited the Time.sleepUntilNanos loop yet.