Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2018: Supervisor V2 #1697

Merged
merged 11 commits into from Oct 27, 2016
Merged

STORM-2018: Supervisor V2 #1697

merged 11 commits into from Oct 27, 2016

Conversation

revans2
Copy link
Contributor

@revans2 revans2 commented Sep 20, 2016

Still need to do some more manual testing but the unit tests passed for me.

<source>1.7</source>
<target>1.7</target>
<source>1.8</source>
<target>1.8</target>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot to revert this, will go back to 1.7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@revans2
Copy link
Contributor Author

revans2 commented Sep 21, 2016

I ran the same set of manual tests as before, but I now want to wait on #1699 to go into master, and then I will pull it in here. We are in the process of rolling essentially what is this same patch out to staging at Yahoo, and plan to roll it out to production shortly too. If others are feeling uncomfortable about merging this into the 1.x line I am happy to wait until we have it in production.

@revans2
Copy link
Contributor Author

revans2 commented Sep 21, 2016

We also found #1700 so once that goes in I'll pull it in here too.

Robert (Bobby) Evans and others added 2 commits September 23, 2016 12:58
…o STORM-2117

STROM-2117:  Supervisor V2 with local mode extracts resources directory to the wrong directory
… into STORM-2110

Conflicts:
	storm-core/test/jvm/org/apache/storm/daemon/supervisor/BasicContainerTest.java
@revans2
Copy link
Contributor Author

revans2 commented Sep 23, 2016

Still have #1699 and #1712 to backport before this is ready

@HeartSaVioR
Copy link
Contributor

@revans2 Relevant PRs (#1699 #1700 #1705 #1712) are all merged to master. Please pull them here.

@revans2
Copy link
Contributor Author

revans2 commented Sep 26, 2016

Just pulled in the latest set of bug fixes from master. All known issues have been addressed and we have been running in staging with various versions of this patch for over a week now. Expect to roll out to production fairly soon.

@revans2
Copy link
Contributor Author

revans2 commented Sep 26, 2016

The test failures look unrelated. Some are rat failures caused by test logs not being excluded.

@HeartSaVioR
Copy link
Contributor

I just cherry-picked commit which excludes logs from RAT. It's merged to master but was part of port work so didn't port back.

@HeartSaVioR
Copy link
Contributor

While running build in storm-core I found that null/storm-local directory is created in storm-core. Maybe there's a case base path is set to null.

@@ -0,0 +1,126 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems to be clashed with healthcheck.clj.

2016-10-01 11:53:24.105 timer o.a.s.d.s.DefaultUncaughtExceptionHandler [ERROR] Error when processing event
java.lang.NoClassDefFoundError: org/apache/storm/command/HealthCheck (wrong name: org/apache/storm/command/healthcheck)
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.apache.storm.daemon.supervisor.timer.SupervisorHealthCheck.run(SupervisorHealthCheck.java:36)
    at org.apache.storm.StormTimer$1.run(StormTimer.java:190)
    at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:83)

@HeartSaVioR
Copy link
Contributor

I found two issues, but other than that manual tests passed. Code review is already done from PR for master branch. +1 once these are resolved.

@HeartSaVioR
Copy link
Contributor

@revans2
It would be better to address STORM-2131 here as well. Please pull #1724 here.
And could you update the pull request according to the review comments? Supervisor V2 is the one I would want to include to 1.1.0.

Thanks in advance.

@HeartSaVioR
Copy link
Contributor

@revans2 Do you have any updates on this? I'm occasionally seeing Supervisor failures so would like to get this merged to 1.x, and even 1.0.x.

@revans2
Copy link
Contributor Author

revans2 commented Oct 25, 2016

@HeartSaVioR Sorry this has taken so long. I am going to upmerge this and pull in #1724 next.

Conflicts:
	storm-core/src/clj/org/apache/storm/pacemaker/pacemaker_state_factory.clj
@revans2
Copy link
Contributor Author

revans2 commented Oct 25, 2016

Just pushed the upmerged code. Will look into pulling in #1724 too.

@revans2
Copy link
Contributor Author

revans2 commented Oct 25, 2016

Merged in #1724 now too (it was a trivial cherry pick).

@HeartSaVioR if you want to take a look this should be good for merging in.

Just as an FYI we have been running with a version of this in production for a little while now with no real issues.

Once this goes in if you want me to I can take a look at pulling it back to 1.0.x too.

@HeartSaVioR
Copy link
Contributor

@revans2 Shouldn't healthcheck.clj be deleted? At least for me HealthCheck.java clashes with healthcheck.clj. I can't clearly say why, might be specific issue with OSX, but anyway there's an issue. I left a comment regarding this.

@revans2
Copy link
Contributor Author

revans2 commented Oct 26, 2016

@HeartSaVioR good catch, I thought I had deleted it.

@srdo
Copy link
Contributor

srdo commented Oct 26, 2016

I seem to be getting a few new errors when running some of our own unit tests with this branch. The exceptions are intermittent.

java.lang.NullPointerException
    at org.apache.storm.utils.DisruptorQueue$FlusherPool.stop(DisruptorQueue.java:110)
    at org.apache.storm.utils.DisruptorQueue$Flusher.close(DisruptorQueue.java:293)
    at org.apache.storm.utils.DisruptorQueue.haltWithInterrupt(DisruptorQueue.java:410)
    at org.apache.storm.disruptor$halt_with_interrupt_BANG_.invoke(disruptor.clj:77)
    at org.apache.storm.daemon.executor$mk_executor$reify__4923.shutdown(executor.clj:412)
    at sun.reflect.GeneratedMethodAccessor303.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
    at org.apache.storm.daemon.worker$fn__5550$exec_fn__1372__auto__$reify__5552$shutdown_STAR___5572.invoke(worker.clj:668)
    at org.apache.storm.daemon.worker$fn__5550$exec_fn__1372__auto__$reify$reify__5598.shutdown(worker.clj:706)
    at org.apache.storm.ProcessSimulator.killProcess(ProcessSimulator.java:66)
    at org.apache.storm.ProcessSimulator.killAllProcesses(ProcessSimulator.java:79)
    at org.apache.storm.testing$kill_local_storm_cluster.invoke(testing.clj:207)
    at org.apache.storm.testing4j$_withLocalCluster.invoke(testing4j.clj:93)
    at org.apache.storm.Testing.withLocalCluster(Unknown Source)

and this kind of error

java.lang.IllegalStateException: Timer is not active
    at org.apache.storm.timer$check_active_BANG_.invoke(timer.clj:87)
    at org.apache.storm.timer$cancel_timer.invoke(timer.clj:120)
    at org.apache.storm.daemon.worker$fn__5550$exec_fn__1372__auto__$reify__5552$shutdown_STAR___5572.invoke(worker.clj:682)
    at org.apache.storm.daemon.worker$fn__5550$exec_fn__1372__auto__$reify$reify__5598.shutdown(worker.clj:706)
    at org.apache.storm.ProcessSimulator.killProcess(ProcessSimulator.java:66)
    at org.apache.storm.ProcessSimulator.killAllProcesses(ProcessSimulator.java:79)
    at org.apache.storm.testing$kill_local_storm_cluster.invoke(testing.clj:207)
    at org.apache.storm.testing4j$_withLocalCluster.invoke(testing4j.clj:93)
    at org.apache.storm.Testing.withLocalCluster(Unknown Source)

Our tests are running Storm in local mode with no time simulation. I've tried running the same tests on 1.x-branch, and these don't seem to occur there.

@revans2
Copy link
Contributor Author

revans2 commented Oct 26, 2016

OK so going through the code in both cases it looks like the only way that can happen is if the workers is somehow being shut down multiple times. My guess is that because the slots are on different threads there is a race now between shutting down a worker through the slot and shutting down the worker through the cluster shutting down.

I'll look into reproducing it. @srdo is there any way you can share your test case with us? It would make my job a lot simpler of trying to reproduce and fix it.

@HeartSaVioR
Copy link
Contributor

@revans2 @srdo
I'm even OK if we file an issue regarding intermittent race condition for local cluster and merge this now, since the race condition of Supervisor in 1.x is much critical. It even occurs on clustered environment.

@srdo
Copy link
Contributor

srdo commented Oct 27, 2016

@revans2 I can't share our actual test code since it depends on pretty large chunks of our codebase. I'll try reproducing with an example topology.

I'd be fine with filing a separate issue to fix the race so this PR isn't blocked.

@revans2
Copy link
Contributor Author

revans2 commented Oct 27, 2016

@srdo sounds good. I filed https://issues.apache.org/jira/browse/STORM-2175 to address the race condition.

@HeartSaVioR
Copy link
Contributor

OK. Unit and integration tests, and manual tests passed on recent commits.
I'm +1 and will merge this now since there's no more reviewer and this was open for fairly long, more than 1 month despite it's kind: backport.

Thanks for the amazing work.

Btw, I'm in favor of just having 1.1.0, not having 1.0.3 unless there's specific request on it. If you would be OK to port back on demand, we could skip it.

@asfgit asfgit merged commit df33efd into apache:1.x-branch Oct 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants