New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-2018: Supervisor V2 #1697
Conversation
<source>1.7</source> | ||
<target>1.7</target> | ||
<source>1.8</source> | ||
<target>1.8</target> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops forgot to revert this, will go back to 1.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
I ran the same set of manual tests as before, but I now want to wait on #1699 to go into master, and then I will pull it in here. We are in the process of rolling essentially what is this same patch out to staging at Yahoo, and plan to roll it out to production shortly too. If others are feeling uncomfortable about merging this into the 1.x line I am happy to wait until we have it in production. |
We also found #1700 so once that goes in I'll pull it in here too. |
…o STORM-2117 STROM-2117: Supervisor V2 with local mode extracts resources directory to the wrong directory
… into STORM-2110 Conflicts: storm-core/test/jvm/org/apache/storm/daemon/supervisor/BasicContainerTest.java
… into STORM-2109
… into STORM-2122
Just pulled in the latest set of bug fixes from master. All known issues have been addressed and we have been running in staging with various versions of this patch for over a week now. Expect to roll out to production fairly soon. |
The test failures look unrelated. Some are rat failures caused by test logs not being excluded. |
I just cherry-picked commit which excludes logs from RAT. It's merged to master but was part of port work so didn't port back. |
While running build in storm-core I found that null/storm-local directory is created in storm-core. Maybe there's a case base path is set to null. |
@@ -0,0 +1,126 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file seems to be clashed with healthcheck.clj.
2016-10-01 11:53:24.105 timer o.a.s.d.s.DefaultUncaughtExceptionHandler [ERROR] Error when processing event
java.lang.NoClassDefFoundError: org/apache/storm/command/HealthCheck (wrong name: org/apache/storm/command/healthcheck)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.storm.daemon.supervisor.timer.SupervisorHealthCheck.run(SupervisorHealthCheck.java:36)
at org.apache.storm.StormTimer$1.run(StormTimer.java:190)
at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:83)
I found two issues, but other than that manual tests passed. Code review is already done from PR for master branch. +1 once these are resolved. |
@revans2 Do you have any updates on this? I'm occasionally seeing Supervisor failures so would like to get this merged to 1.x, and even 1.0.x. |
@HeartSaVioR Sorry this has taken so long. I am going to upmerge this and pull in #1724 next. |
Conflicts: storm-core/src/clj/org/apache/storm/pacemaker/pacemaker_state_factory.clj
Just pushed the upmerged code. Will look into pulling in #1724 too. |
Merged in #1724 now too (it was a trivial cherry pick). @HeartSaVioR if you want to take a look this should be good for merging in. Just as an FYI we have been running with a version of this in production for a little while now with no real issues. Once this goes in if you want me to I can take a look at pulling it back to 1.0.x too. |
@revans2 Shouldn't healthcheck.clj be deleted? At least for me HealthCheck.java clashes with healthcheck.clj. I can't clearly say why, might be specific issue with OSX, but anyway there's an issue. I left a comment regarding this. |
@HeartSaVioR good catch, I thought I had deleted it. |
I seem to be getting a few new errors when running some of our own unit tests with this branch. The exceptions are intermittent.
and this kind of error
Our tests are running Storm in local mode with no time simulation. I've tried running the same tests on 1.x-branch, and these don't seem to occur there. |
OK so going through the code in both cases it looks like the only way that can happen is if the workers is somehow being shut down multiple times. My guess is that because the slots are on different threads there is a race now between shutting down a worker through the slot and shutting down the worker through the cluster shutting down. I'll look into reproducing it. @srdo is there any way you can share your test case with us? It would make my job a lot simpler of trying to reproduce and fix it. |
@revans2 I can't share our actual test code since it depends on pretty large chunks of our codebase. I'll try reproducing with an example topology. I'd be fine with filing a separate issue to fix the race so this PR isn't blocked. |
@srdo sounds good. I filed https://issues.apache.org/jira/browse/STORM-2175 to address the race condition. |
OK. Unit and integration tests, and manual tests passed on recent commits. Thanks for the amazing work. Btw, I'm in favor of just having 1.1.0, not having 1.0.3 unless there's specific request on it. If you would be OK to port back on demand, we could skip it. |
Still need to do some more manual testing but the unit tests passed for me.