Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency changes, primary around thread interrupted status and pooling #4794

Merged
merged 3 commits into from Mar 22, 2018

Conversation

@jentfoo
Copy link

commented Mar 13, 2018

This PR is designed to address issues #4766 and #4781. The commit messages should provide fairly detailed descriptions about what I was changing.

Let me know what you think and what you would like to see changed for this PR. I think this should be fairly safe, though I can't seem to get the compile script to fully work (it fails on cuda for me). I do have some ideas for some minor potential performance improvements, but I wanted to make sure I do my hello world PR first.

jentfoo added 3 commits Mar 13, 2018
Correctly handle interrupted thread states
This resolves #4781 by resetting the threads interrupted status after InterruptedException's are caught.
This is important so that future blocking calls interrupt immediately allowing things like JVM shutdown to happen gracefully.

In addition while this was being investigated it looked like `LockSupport.parkNanos` was being used in an unsafe manner.  This may unpark immediately if the thread is marked as interrupted, causing tight loops.

In many cases both problems were solved using a new class `ThreadUtils` which uses LockSupport.parkNanos to create sleep like behavior without a checked interrupted exception.  This makes the handling of these interrupted states more uniform.
In other cases the sleep / polling behavior was changed to a wait / notify behavior.  I think doing this can often be better than polling, but not everywhere was updated.
Places where we were parking for less than 1 millisecond were updated to 1 millisecond as a minimum time resolution.
Resolve thread leaks by switching to threadly
This resolves #4766 as well as a couple other thread leaks I found.
Threadly pools perform great, are robust (often times with greater garuntees than java.util), and in this case wont leak threads if garbage collected.
Not all cases updated in this commit were leaks, but I did update several places to maintain consistency in the project.  I think in the future we may want to make the lifecycle of these schedulers a bit cleaner, but this provides an immediate win.
`ExecutorService` was updated to the more generic `Executor` where the executor is only needed for execution (and lifecycle is not managed)

By default threadly schedulers are daemon threads, so in many cases the code was simplified as well.
Threadly's version at the time of this commit is 5.14, which does offer further performance gains and features.  But does require java 8.  So this commit depends on 4.10.0 which works with java 6+.
Deprecated ConcurrentHashSet
Deprecated `ConcurrentHashSet` to be replaced by `Collections.newSetFromMap(new ConcurrentHashMap<>())`.

@jentfoo jentfoo force-pushed the jentfoo:concurrencyChanges branch from 67e02de to ffee69b Mar 13, 2018

@saudet saudet requested a review from AlexDBlack Mar 14, 2018

@saudet

This comment has been minimized.

Copy link
Member

commented Mar 14, 2018

Looks good to me! Thanks. Have you run the tests though to make sure nothing breaks? Just the CPU ones with -Ptest-nd4j-native is fine. There might be some that don't run on your machine anyway, so don't worry too much about those. :)

@AlexDBlack
Copy link
Contributor

left a comment

Overall, mostly looks reasonable to me. I'd want @raver119 to review this also. And definitely need to check all unit tests too.

@@ -385,7 +384,7 @@ public void remove() {
private BlockingQueue<DataSet> queue;
private DataSetIterator iterator;
private DataSet terminator;
private AtomicBoolean isShutdown = new AtomicBoolean(false);
private boolean isShutdown = false; // locked around `this`

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Mar 14, 2018

Contributor

This doesn't seem like a great idea to me?
We've gone from a lock free thread safe implementation, to something that's not lock free and not strictly thread safe? i.e., need volatile keyword here, at the very least?

This comment has been minimized.

Copy link
@jentfoo

jentfoo Mar 14, 2018

Author

I am glad you brought this up. There is a LOT of polling designs in this code base. This was an example of where I switched a polling design to a wait/notify.

First this is correct, there is no need for a volatile keyword because this is only updated or read from inside a synchronized block of this. If this variable needed to be read without synchronized, then we could add volatile to achieve that, but right now it's not necessary.

As for no-locking to locking, you are sorta correct. The synchronized blocks will be held very short, only held while they are updating the value, and sending the notify call. Or held while checking the value and waking up from their wait. In this specific case, we will only have one thread doing the update. And likely only one checking the update, but we can even assume more and this still should perform better than before.

When this would be problematic is if we have a lot of threads in the wait. When the notify occurs those threads would need to acquire the lock one by one as they wake up, then release it. So I would not suggest changing it to a wait / notify if there is going to be a lot of waiters, or high thread contention, I just don't think that is the case here.

What's wrong with polling you might ask? Several things, but the biggest in my mind is the context switching that must occur. The thread must wakeup, load it's state into the cpu cache, and then check a small boolean before it must again park itself. The more threads you have doing this, the more thrashing occurs.

In short, no solution is the right solution for all cases. I would like to provide future PR's with benchmarks and prove performance gains. But these seemed like some obvious and safe points where I could start those changes.

Let me know if you have any more questions or thoughts about this.

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Mar 15, 2018

Contributor

Thanks for the explanation, I think I'm happy with that. 👍
Thinking about this more, given we really only have 2 threads operating here, we shouldn't really be worried about performance and lock contention anyway. And, as you say, context switching is a potential (likely?) performance issue anyway with the current design.

@@ -312,8 +310,7 @@ protected void externalCall() {
private BlockingQueue<MultiDataSet> queue;
private MultiDataSetIterator iterator;
private MultiDataSet terminator;
private AtomicBoolean isShutdown = new AtomicBoolean(false);
private AtomicLong internalCounter = new AtomicLong(0);
private boolean isShutdown = false; // locked around `this`

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Mar 14, 2018

Contributor

As per previous comment (re: volatile).

@@ -165,7 +165,7 @@ public void registerConsumers(int numConsumers) {
log.info("Master thread locks at RC");

while (registered.get()) {
LockSupport.parkNanos(100L);
ThreadUtils.uncheckedSleep(1);

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Mar 14, 2018

Contributor

Not sure it matters too much in practice, but we are changing the wait time by a few orders of magnitude here...

This comment has been minimized.

Copy link
@jentfoo

jentfoo Mar 14, 2018

Author

I would argue this is better (I guess that's obvious since I provided the PR).

Few OS / schedulers will be able to park a thread for only 100 nanoseconds. IME on linux you are looking at about 1000 nanoseconds in the least, and 1 millisecond at the most. On windows 1ms tends to actually be optimistic.

I tend to target 1 millisecond because that will provide consistent behavior. In addition by waking up more frequently you are furthering the context switching / cache miss problem that I described above.

@jentfoo

This comment has been minimized.

Copy link
Author

commented Mar 14, 2018

Just the CPU ones with -Ptest-nd4j-native is fine.

Do you have any docs on how I run these? I have been using the .sh files in the github, but they fail once they hit cuda. I also was unsure if they are just compiling, or running tests too.

@saudet

This comment has been minimized.

Copy link
Member

commented Mar 15, 2018

Since you don't need to build ND4J, try to just use the snapshots: https://deeplearning4j.org/snapshots
Then you'll just need to run mvn -Ptest-nd4j-native ... inside deeplearning4j. You'll probably need to clone and install https://github.com/deeplearning4j/dl4j-test-resources too though (issue #3501), but that should be about it.

@AlexDBlack
Copy link
Contributor

left a comment

I'm happy with this - probably not a bad idea for @raver119 to review also before merging, given his work on the async iterators and NLP functionality.

@@ -385,7 +384,7 @@ public void remove() {
private BlockingQueue<DataSet> queue;
private DataSetIterator iterator;
private DataSet terminator;
private AtomicBoolean isShutdown = new AtomicBoolean(false);
private boolean isShutdown = false; // locked around `this`

This comment has been minimized.

Copy link
@AlexDBlack

AlexDBlack Mar 15, 2018

Contributor

Thanks for the explanation, I think I'm happy with that. 👍
Thinking about this more, given we really only have 2 threads operating here, we shouldn't really be worried about performance and lock contention anyway. And, as you say, context switching is a potential (likely?) performance issue anyway with the current design.

@jentfoo

This comment has been minimized.

Copy link
Author

commented Mar 16, 2018

@saudet I got the needed dependencies without too much trouble. However the master unit tests are not passing on my computer:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project nearestneighbor-core: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: java.lang.RuntimeException:  100, 1, 0, 1, 99]: For input string: " 99]" -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project nearestneighbor-core: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: java.lang.RuntimeException:  100, 1, 0, 1, 99]
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
	at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
	at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
	at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
	at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
	at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
	at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
	at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
	at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
	at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
	at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
	at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: java.lang.RuntimeException:  100, 1, 0, 1, 99]
	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
	at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
	... 20 more
Caused by: java.lang.RuntimeException: java.lang.RuntimeException:  100, 1, 0, 1, 99]
	at org.apache.maven.plugin.surefire.booterclient.output.ThreadedStreamConsumer.close(ThreadedStreamConsumer.java:127)
	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:569)
	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:460)
	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:229)
	at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:201)
	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1026)
	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:862)
	at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:755)
	at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
	... 21 more
Caused by: java.lang.RuntimeException:  100, 1, 0, 1, 99]
	at org.apache.maven.plugin.surefire.booterclient.output.ForkClient.createReportEntry(ForkClient.java:286)
	at org.apache.maven.plugin.surefire.booterclient.output.ForkClient.processLine(ForkClient.java:152)
	at org.apache.maven.plugin.surefire.booterclient.output.ForkClient.consumeLine(ForkClient.java:115)
	at org.apache.maven.plugin.surefire.booterclient.output.ThreadedStreamConsumer$Pumper.run(ThreadedStreamConsumer.java:70)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: " 99]"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:569)
	at java.lang.Integer.valueOf(Integer.java:740)
	at java.lang.Integer.decode(Integer.java:1197)
	at org.apache.maven.plugin.surefire.booterclient.output.ForkClient.createReportEntry(ForkClient.java:278)
	... 4 more

Any thoughts as to what I might be missing?

I am guessing since it passed on the CI this is just an issue with my setup still, but I want to make sure I get my local tests working before I attempt more changes.

@saudet

This comment has been minimized.

Copy link
Member

commented Mar 17, 2018

Some tests are flaky, so if the same exact tests fail with and without your patch, we'll be reasonably sure it's not breaking anything :) We can ignore failing test and continue testing with mvn -Ptest-nd4j-native -Dmaven.test.failure.ignore=true .... The CI server isn't in a good enough shape to run the tests for us yet, unfortunately. :(

@jentfoo

This comment has been minimized.

Copy link
Author

commented Mar 18, 2018

@saudet I tried that, but it still is failing. If you look at the stack I don't even see deeplearnin4j in it. So I am not sure what the issue is.

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2018

Hm... maybe I'll run tests manually on this branch/PR (vs. master) and make a determination. There's a small number of known issue with some tests on master.

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2018

Apologies for not getting back to this sooner. I've run some of the tests manually and compared vs. master... not everything is passing on master, but I'm not seeing any new failures on this branch/PR - so I'll merge it.
Thanks again @jentfoo !

@AlexDBlack AlexDBlack merged commit 2a4491d into eclipse:master Mar 22, 2018

0 of 2 checks passed

continuous-integration/jenkins/pr-merge This commit cannot be built
Details
codeclimate 5 issues to fix
Details
@jentfoo

This comment has been minimized.

Copy link
Author

commented Mar 23, 2018

@AlexDBlack No worries at all! Thank you guys, I am sure I will have some more PR's in the future :)

@jentfoo jentfoo deleted the jentfoo:concurrencyChanges branch Mar 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.