Skip to content

STORM-3120: Clean up leftover null checks in Time, ensure idle thread…#2734

Merged
asfgit merged 1 commit intoapache:masterfrom
srdo:STORM-3120
Jun 27, 2018
Merged

STORM-3120: Clean up leftover null checks in Time, ensure idle thread…#2734
asfgit merged 1 commit intoapache:masterfrom
srdo:STORM-3120

Conversation

@srdo
Copy link
Contributor

@srdo srdo commented Jun 24, 2018

…s get to run when cluster time is advanced

https://issues.apache.org/jira/browse/STORM-3120

Some of the Time code didn't make sense, e.g. checking for null on the final THREAD_SLEEP_TIMES_NANOS map. It was left over from an earlier refactor of Time. Cleaned it up, and deleted the deprecated Time methods. Also made sure that when simulated time is advanced, idle threads are removed from the THREAD_SLEEP_TIMES_NANOS immediately. When the LocalCluster waits for the topology to be idle, it checks whether all timer threads are in the THREAD_SLEEP_TIMES_NANOS map. It looks to me like there was no guarantee that sleeping threads actually got a chance to run when time was advanced, because they remained in the THREAD_SLEEP_TIMES_NANOS map until they happened to be scheduled. With bad luck, a thread might end up being counted as idle when it just hadn't exited the Time.sleepUntilNanos loop yet.

this.tmpDirs.add(nimbusTmp);
stormHomeBackup = System.getProperty(ConfigUtils.STORM_HOME);
TmpPath stormHome = new TmpPath();
if (!stormHome.getFile().mkdirs()) {
Copy link
Contributor

@HeartSaVioR HeartSaVioR Jun 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be going too strict, but this looks like follow-up of STORM-3116 (not relevant to Time) which would be better to have individual PR. The change is minor so no need to file a new issue. Just would like to avoid things getting mixed up in a commit.

Will continue reviewing other change in PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. Do you want me to put this into a STORM-3116 PR, or what do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to see this being separated PR, tagging as STORM-3116 or just tagging as MINOR. Either is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to #2736

public static void sleep(long ms) throws InterruptedException {
if (ms > 0) {
sleepUntil(currentTimeMillis() + ms);
if (SIMULATING.get()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to not use sleepUntil? If you think sleepUntil is broken, the method must be fixed since the method is exposed to public. If not, why not using sleepUntil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleepUntil works fine. It's just a little silly in the case where we aren't simulating, because this method would calculate currentTimeMillis() + ms, and sleepUntil would then do the reverse to calculate how many ms to pass to Thread.sleep.

public static void parkNanos(long nanos) throws InterruptedException {
if (nanos > 0) {
sleepUntilNanos(nanoTime() + nanos);
if (SIMULATING.get()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, and we may note that we're renaming public method, though Time class is more likely internal class though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I'd like to use LockSupport.parkNanos if we're not simulating, rather than sleeping for the closest number of milliseconds as the method did before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, forgot to respond to the renaming bit: Yes, the method is renamed. If we want to port this to 1.x, we can easily deprecate sleepNanos and provide both methods.

@HeartSaVioR
Copy link
Contributor

Looks good to me overall except two comments above. (I forgot to comment...)

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 just a single nit.

}
}
LOG.debug("Advanced simulated time to {}", newTime);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this code change is needed, as you fixed the bug with checking for THREAD_SLEEP_TIMES_NANOS == null, so the finally should work. If it does not work, then java itself has some bugs in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not why this change is added. I want to make sure that when a test does LocalCluster.advanceClusterTime, all sleeping threads that are supposed to wake up actually get a chance to run before waitForIdle is checked again. The issue is that if we don't remove the thread from the map here, there isn't (as far as I can tell) a guarantee that the sleeping threads actually woke up.

For example, if a test has a loop

while (some condition)
  some code 
  LocalCluster.advanceClusterTime

the first call to advanceClusterTime will wait until all threads are sleeping. If thread 1 is told to sleep for 1ms, it will enter the THREAD_SLEEP_TIMES_NANOS map. On the next iteration, LocalCluster.advanceClusterTime will increment the time, which should cause thread 1 to wake up. If the testing thread is "too fast", it might recheck waitForIdle and catch thread 1 before it manages to wake up and remove itself from THREAD_SLEEP_TIMES_NANOS. This causes LocalCluster.advanceClusterTime to return before the cluster is actually idle, because thread 1 will wake up and start running again after advanceClusterTime returns.

The effect is you can't be sure that all threads are actually idle once LocalCluster.advanceClusterTime returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK +1

@asfgit asfgit merged commit e5ca0c9 into apache:master Jun 27, 2018
asfgit pushed a commit that referenced this pull request Jun 27, 2018
…-3120

STORM-3120: Clean up leftover null checks in Time, ensure idle thread

This closes #2734
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants